Add ChaisemartinDHaultfoeuille (dCDH) DID_M estimator (Phase 1) by igerber · Pull Request #290 · igerber/diff-diff

igerber · 2026-04-11T20:09:50Z

Summary

Adds ChaisemartinDHaultfoeuille (alias DCDH) — Phase 1 of the de Chaisemartin-D'Haultfœuille estimator family. The only modern staggered DiD estimator in the library that handles non-absorbing (reversible) treatments — treatment can switch on AND off over time (marketing campaigns, seasonal promotions, on/off policy cycles, binary fuzzy designs).
Implements DID_M from de Chaisemartin & D'Haultfœuille (2020) AER, equivalently DID_1 (horizon l = 1) of the dynamic companion paper (NBER WP 29873).
Forward-compatible API: Phase 2 (multi-horizon event study via aggregate/L_max) and Phase 3 (covariate adjustment via controls, group-specific linear trends, HonestDiD) parameters present from day one with NotImplementedError gates pointing at the relevant phase.
Ships joiners-only (DID_+) and leavers-only (DID_-) decompositions, single-lag placebo (DID_M^pl) point estimate, optional multiplier bootstrap (Rademacher / Mammen / Webb), TWFE decomposition diagnostic from Theorem 1 of AER 2020 (per-cell weights, fraction negative, sigma_fe, beta_fe), plus a standalone twowayfeweights() helper.
Adds generate_reversible_did_data() generator with 7 patterns (single_switch default, joiners_only, leavers_only, mixed_single_switch, random, cycles, marketing).

Methodology references

Method: dCDH DID_M estimator (= DID_1 at horizon l = 1 of the dynamic paper)
Sources:
- de Chaisemartin, C. & D'Haultfœuille, X. (2020). Two-Way Fixed Effects Estimators with Heterogeneous Treatment Effects. American Economic Review, 110(9), 2964–2996. https://doi.org/10.1257/aer.20181169
- de Chaisemartin, C. & D'Haultfœuille, X. (2022, revised 2024). Difference-in-Differences Estimators of Intertemporal Treatment Effects. NBER Working Paper 29873. https://www.nber.org/papers/w29873
- Three paper reviews under docs/methodology/papers/chaisemartin-dhaultfoeuille-{2020-review,2020-appendix-review,dynamic-review}.md (committed in earlier work; the dynamic-paper review is the source for the cohort-recentered analytical variance from Web Appendix Section 3.7.3)
Reference implementation parity: R DIDmultiplegtDYN v2.3.3 (CRAN, maintained by the paper authors)

Intentional deviations from R `DIDmultiplegtDYN` (documented as `Note (deviation from R):` in REGISTRY.md)

Stable-control set definition. Python uses period-based stable definitions (stable_0(t) = any cell with D_{g,t-1} = D_{g,t} = 0, regardless of baseline D_{g,1}), matching the AER 2020 Theorem 3 cell-count notation N_{0,0,t} and N_{1,1,t} literally. R uses cohort-based definitions that additionally require D_{g,1} to match the side, matching the dynamic companion paper's cohort (D_{g,1}, F_g, S_g) framework. The two definitions agree exactly under any of: (a) the panel is pure-direction (joiners-only or leavers-only), (b) no joiner's post-switch state ever overlaps a period when leavers are switching (and vice versa), or (c) the worked example. They diverge by O(1%) on mixed-direction panels with both joiners and leavers when some joiners' post-switch cells could serve as leavers' controls (or vice versa). Parity tests use 1e-4 tolerance for pure-direction scenarios and 2.5% tolerance for mixed-direction scenarios, with the deviation explicitly documented in docs/methodology/REGISTRY.md.

Phase 1 placebo SE. Intentionally NaN with a UserWarning. The dynamic companion paper Section 3.7.3 derives the cohort-recentered analytical variance for DID_l only — not for the placebo DID_M^pl. Phase 2 will add multiplier-bootstrap support for the placebo via the dynamic paper's machinery; until then, placebo_se, placebo_t_stat, placebo_p_value, and placebo_conf_int remain NaN even when n_bootstrap > 0. The placebo point estimate is still computed and exposed via results.placebo_effect. A regression test (test_placebo_bootstrap_unavailable_in_phase_1) pins this contract.

Assumption 11 zero-retention convention. When some period t has joiners but no stable_0 controls (or leavers but no stable_1 controls), Python sets that period's DID_{+,t} (or DID_{-,t}) to zero but retains the switcher count in the N_S denominator, matching the AER 2020 Theorem 3 worked-example arithmetic. R drops the cohort entirely. Per-period flags (did_plus_t_a11_zeroed, did_minus_t_a11_zeroed) and a consolidated UserWarning make this visible to users. The R parity scenarios add never-treated and always-treated control cohorts so neither implementation triggers the convention, eliminating this as a source of disagreement in the parity tests.

Documented matches with R (worth noting because they're substantive choices)

drop_larger_lower=True default — drops multi-switch groups before estimation (matches R; required for the analytical variance to be consistent with the point estimate)
Singleton-baseline filter (footnote 15 of dynamic paper) — drops groups whose D_{g,1} is unique
Cohort-recentered analytical variance from Web Appendix Section 3.7.3 of the dynamic paper (subtracts cohort-conditional means, not a grand mean — the load-bearing variance correctness check that has its own dedicated test)
Conservative CI under Assumption 8 (independent groups), exact only under iid sampling — documented as a **Note:** in REGISTRY.md

No silent failures

Every drop / round / fallback emits an explicit warnings.warn() or ValueError. Five distinct warnings/errors fire on the relevant code paths: NaN treatment validation, NaN outcome validation, placebo not-computable, TWFE diagnostic denom == 0, never-switching groups filter from variance, plus the multi-switch and singleton-baseline filters and the consolidated A11 warning.

Validation

Test files added

tests/test_chaisemartin_dhaultfoeuille.py — 44 tests covering basic API, forward-compat gates (one per Phase 2/3 parameter), drop_larger_lower filter, A11 zero-retention, NaN handling, bootstrap (Rademacher / Mammen / Webb), results dataclass, and the new test_placebo_bootstrap_unavailable_in_phase_1 regression test
tests/test_methodology_chaisemartin_dhaultfoeuille.py — 8 fast + 2 slow Tier 2 methodology tests including the hand-calculable 4-group worked example (asserts DID_M = 2.5, DID_+ = 2.0, DID_- = 3.0 exactly), test_cohort_recentering_not_grand_mean (the load-bearing variance correctness check that catches the Add initial diff-diff library implementation #1 implementation bug), TWFE diagnostic correctness, and large-N recovery (slow)
tests/test_chaisemartin_dhaultfoeuille_parity.py — 5 R DIDmultiplegtDYN parity scenarios at horizon l = 1: single_switch_mixed, joiners_only, leavers_only, mixed_single_switch, and the hand-calculable worked example
tests/test_prep_dgp_reversible.py — 38 tests for the new generate_reversible_did_data() generator (shape, patterns, reproducibility, true-effect column, switcher-type column, validation errors)

Test files modified

tests/conftest.py — added _check_r_dcdh_available(), r_dcdh_available session fixture, and require_r_dcdh per-test fixture parallel to the existing r_available / require_r pattern, with DIFF_DIFF_R=skip env-var override

R parity reference

benchmarks/R/generate_dcdh_dynr_test_values.R — mirror generator + 5 scenarios, generates golden values via did_multiplegt_dyn(..., effects=1) from R DIDmultiplegtDYN v2.3.3
benchmarks/data/dcdh_dynr_golden_values.json — committed golden values (~73 KB)
The R script also exercises a polars r-universe backend that DIDmultiplegtDYN >= 2.x requires
The Python parity test fixture skips cleanly when R or DIDmultiplegtDYN is unavailable, so the test suite still runs on dev machines without R installed

Documentation surface

Per docs/doc-deps.yaml for the new chaisemartin_dhaultfoeuille group:

docs/methodology/REGISTRY.md — full ## ChaisemartinDHaultfoeuille section with assumption checks, estimator equations, the four **Note:** blocks (conservative CI, drop_larger_lower, A11 zero-retention, singleton-baseline filter), the **Note (deviation from R DIDmultiplegtDYN):** block, edge cases, requirements checklist
docs/api/chaisemartin_dhaultfoeuille.rst — autoclass directives + autofunction for chaisemartin_dhaultfoeuille() and twowayfeweights(), with usage examples
docs/api/index.rst — toctree entry; docs/api/prep.rst — generate_reversible_did_data autofunction
docs/choosing_estimator.rst — Quick Reference row, decision-flowchart question, Survey Compatibility Matrix row (-- in all four columns), Detailed Guidance subsection
docs/practitioner_decision_tree.rst — new "Reversible Treatment (On/Off Cycles)" section with business-language framing
docs/llms.txt and docs/llms-full.txt — bullet + ~100-line section
README.md — features bullet, alias table entry, full usage section
CHANGELOG.md — ### Added entry under ## [Unreleased]
ROADMAP.md — Phase 1 items (1a–1i) flipped from "Not started" to "Shipped"; Phase 2 / Phase 3 retained

Test count

96 dCDH tests passing locally (44 API + 8 methodology fast + 2 methodology slow [deselected by default] + 5 R parity + 38 generator – overlap = 95 deselectable + 1 placebo regression = 96 total). The hand-calculable 4-group worked example produces DID_M = 2.5 exactly matching R DIDmultiplegtDYN.

No tutorial in this PR

Per the multi-phase plan, a single comprehensive dCDH tutorial ships in Phase 3 after covariate support is added. Phase 1 deliberately ships no tutorial.

Security / privacy

Confirm no secrets/PII in this PR: Yes

Phase 1 of the de Chaisemartin-D'Haultfoeuille estimator family — the only modern staggered DiD estimator in the library that handles non-absorbing (reversible) treatments. Treatment can switch on AND off over time (marketing campaigns, seasonal promotions, on/off policy cycles). Implements DID_M from de Chaisemartin & D'Haultfoeuille (2020) AER, equivalently DID_1 (horizon l=1) of the dynamic companion paper (NBER WP 29873). Ships: - Headline DID_M point estimate with cohort-recentered analytical SE from Web Appendix Section 3.7.3 of the dynamic companion paper - Joiners-only (DID_+) and leavers-only (DID_-) decompositions - Single-lag placebo DID_M^pl (Theorem 4 of AER 2020); analytical placebo SE deferred to Phase 2 (NaN with explicit warning) - Optional multiplier bootstrap clustered at group level (Rademacher / Mammen / Webb) - TWFE decomposition diagnostic from Theorem 1 of AER 2020 (per-cell weights, fraction negative, sigma_fe, beta_fe), exposed both as estimator field and standalone twowayfeweights() helper - Multi-switch group filtering (drop_larger_lower=True default, matches R DIDmultiplegtDYN); singleton-baseline filter (footnote 15 of dynamic paper); never-switching groups filter from variance — all with explicit warnings (no silent failures) - Forward-compatible fit() signature: Phase 2 (multi-horizon event study, aggregate, L_max) and Phase 3 (covariate adjustment via controls, group-specific linear trends, HonestDiD) parameters raise NotImplementedError with phase pointers - generate_reversible_did_data() generator with 7 patterns (single_switch, joiners_only, leavers_only, mixed_single_switch, random, cycles, marketing) - Validated against R DIDmultiplegtDYN v2.3.3 at horizon l=1 via tests/test_chaisemartin_dhaultfoeuille_parity.py (5 scenarios) Methodology notes documented in REGISTRY.md: - Conservative CI under Assumption 8 (independent groups), exact only under iid sampling - A11 zero-retention convention (paper-faithful, with consolidated warning) - drop_larger_lower default and inconsistency warning when False - Singleton-baseline filter (footnote 15) - Phase 1 placebo SE intentionally NaN (deferred to Phase 2) - Note (deviation from R DIDmultiplegtDYN): Python uses period-based stable-control sets matching AER 2020 Theorem 3 literally; R uses cohort-based controls from the dynamic paper. Agree exactly on pure-direction panels and the worked example; differ by ~1% on mixed-direction panels with both joiners and leavers. Tests: 95 passing locally (44 API, 8 fast methodology + 2 slow Tier 2, 38 generator, 5 R parity). Hand-calculable 4-group worked example produces DID_M = 2.5 exactly matching R. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…s API Methodology / inference: - Bootstrap-updated overall/joiners/leavers inference now goes through safe_inference() instead of inline t_stat = effect / se. Matches the project anti-pattern rule and keeps NaN behavior consistent with the rest of the library. Placebo doc/code alignment: - Phase 1 placebo bootstrap is deferred to Phase 2 (the dynamic companion paper Section 3.7.3 derives the cohort-recentered analytical variance for DID_l only; the placebo's influence-function representation is not yet implemented). The bootstrap path covers DID_M, DID_+, and DID_- only. - Updated the source warning text, the placebo_inputs comment, README, CHANGELOG, REGISTRY, choosing_estimator.rst, llms-full.txt to drop the misleading "set n_bootstrap > 0 for a placebo SE today" promise and state clearly that placebo inference fields stay NaN even with bootstrap enabled. - Added test_placebo_bootstrap_unavailable_in_phase_1 to pin this contract so future contributors do not silently widen the bootstrap surface without wiring up the documented Phase 2 derivation. twowayfeweights() public API: - Added ValueError validation for missing outcome/group/time/treatment columns (was treatment-only). - Updated README, docs/api/chaisemartin_dhaultfoeuille.rst, and docs/llms-full.txt examples to pass outcome="outcome" so the documented usage actually runs. set_params(): - Re-runs the same __init__ validation rules so invalid parameter combinations cannot slip in after construction. Tests: - 96 passing locally (was 95). The new placebo test runs alongside TestBootstrap. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Final doc inconsistency caught on re-review: practitioner_decision_tree.rst still said "set n_bootstrap > 0 for a bootstrap-based placebo SE today" while README, REGISTRY, choosing_estimator, and llms-full had been updated to reflect that placebo bootstrap is deferred to Phase 2 and placebo inference fields stay NaN even when n_bootstrap > 0. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

igerber · 2026-04-11T20:10:14Z

/ai-review

github-actions · 2026-04-11T20:18:58Z

🔁 AI review rerun (requested by @igerber)

Head SHA: fc881fb78d806e089150f0eba1d584886f845a5e

Overall Assessment

⚠️ Needs changes

This PR adds a new estimator and touches methodology-critical code paths, but I found three unmitigated P1 issues in the estimation/inference logic. Static review only: I was not able to run the new test suite in this shell because the local Python runtime here is missing project deps such as pandas.

Executive Summary

The Phase 1 DID_M / DID_M^pl implementation is using per-cell observation counts n_gt as Theorem 3/4 weights, while the new Methodology Registry defines N_{a,b,t} as counts of (g,t) cells. That silently changes the estimand on individual-level inputs with uneven cell sizes.
The analytical SE path does not implement the registry’s full U^G_g = (Lambda^G_{g,l=1} * Y_g).sum() construction; it explicitly substitutes a first-switch-only “Phase 1 simplification,” and the bootstrap reuses that same vector.
The singleton-baseline filter is applied to the estimation sample, although the registry documents it as a variance-stage filter. Under the documented period-based stable-control deviation, that can change DID_M, not just the variance.
The new methodology-reference files linked from the roadmap/doc-deps are not present in the repo, and the added tests do not actually validate SE parity or the claimed “cohort recentering vs grand mean” distinction.
I did not find additional security issues in the changed files.

Methodology

Severity: P1. Impact: DID_M, DID_+, DID_-, and DID_M^pl are being weighted by per-cell sample size (n_gt) rather than by the transition-cell counts documented in the registry, so duplicating rows inside one (group,time) cell changes the estimator with no warning. The issue starts at cell aggregation and then propagates through the per-period numerators/denominators and the final DID_M aggregation. Concrete fix: collapse to one row per (g,t) cell for the Theorem 3/4 path and use simple cell counts for N_{1,0,t}, N_{0,0,t}, N_{0,1,t}, N_{1,1,t}; if a cell-size-weighted estimand is intended, document it explicitly in REGISTRY.md as a deviation and add a regression test showing whether duplicating within-cell rows should or should not move the estimate. Refs: docs/methodology/REGISTRY.md:L483-L514, diff_diff/chaisemartin_dhaultfoeuille.py:L437-L441, diff_diff/chaisemartin_dhaultfoeuille.py:L1041-L1084, diff_diff/chaisemartin_dhaultfoeuille.py:L577-L603, diff_diff/chaisemartin_dhaultfoeuille.py:L1177-L1218
Severity: P1. Impact: the analytical SE / CI / p-value path is not implementing the registry’s Section 3.7.3 formula. The registry says U^G_g must come from the full Lambda^G_{g,l=1} vector, but the code explicitly replaces that with a “Phase 1 simplification” that only uses the group’s first-switch-period contribution. That omits the periods where a switching group contributes as a stable control, so both analytical inference and the multiplier bootstrap can be wrong even when the point estimate is right. Concrete fix: implement the full Lambda^G_{g,l=1}-based influence function from the dynamic paper, then cohort-recenter and bootstrap that vector; add parity tests for SEs on pure-direction scenarios where the registry says Python and R should agree. Refs: docs/methodology/REGISTRY.md:L516-L529, diff_diff/chaisemartin_dhaultfoeuille.py:L1223-L1261, diff_diff/chaisemartin_dhaultfoeuille.py:L1348-L1422, diff_diff/chaisemartin_dhaultfoeuille.py:L657-L679, diff_diff/chaisemartin_dhaultfoeuille.py:L747-L755
Severity: P1. Impact: the singleton-baseline filter is applied before the point-estimate path, even though the registry documents it as a variance-stage filter. That matters here because the same registry section also documents a period-based stable_0(t) / stable_1(t) deviation, under which a baseline-unique always-treated or never-treated group can still be part of the Theorem 3 control set. In those cases the current filter changes DID_M, DID_+, or DID_-, not just the variance sample. Concrete fix: keep singleton-baseline groups in the estimation sample and exclude them only inside the variance/cohort code path, or document this as an additional point-estimate deviation and add a targeted parity test where a unique-baseline stable control would otherwise move the estimate. Refs: docs/methodology/REGISTRY.md:L475-L476, docs/methodology/REGISTRY.md:L547-L549, diff_diff/chaisemartin_dhaultfoeuille.py:L511-L528, diff_diff/chaisemartin_dhaultfoeuille.py:L962-L996

Code Quality

No additional material findings beyond the methodology issues above.

Performance

No material findings in the changed performance-sensitive paths beyond the correctness issues above.

Maintainability

No additional material findings beyond the documentation/test gaps below.

Tech Debt

No mitigating TODO.md entry covers the P1 issues above, so they are not deferrable under the project’s review policy.

Security

No material security findings in the changed files.

Documentation/Tests

Severity: P2. Impact: the PR adds methodology-reference links that do not resolve in this repo, and the new tests do not actually validate the claimed inference correctness. ROADMAP.md and docs/doc-deps.yaml now point reviewers to three docs/methodology/papers/chaisemartin-dhaultfoeuille-*.md files, but those files are not present in this checkout. Separately, the supposed “load-bearing” recentering test only asserts that overall_se > 0, and the R parity suite checks point estimates only, not SEs. That leaves the incorrect SE path above effectively unguarded. Concrete fix: commit the referenced methodology-review files, make doc-deps/path resolution fail CI when missing, turn test_cohort_recentering_not_grand_mean into a real λ-vs-grand-mean regression test, and add SE parity assertions for pure-direction scenarios. Refs: ROADMAP.md:L125-L129, docs/doc-deps.yaml:L261-L275, tests/test_methodology_chaisemartin_dhaultfoeuille.py:L162-L251, tests/test_chaisemartin_dhaultfoeuille_parity.py:L122-L168

Path to Approval

Rework the Theorem 3/4 aggregation to use transition-cell counts rather than n_gt observation counts, or explicitly document and test a weighted estimand as a deviation.
Replace the current first-switch-only SE shortcut with the full Lambda^G_{g,l=1} influence-function implementation from Section 3.7.3, and use that same vector for the bootstrap path.
Move the singleton-baseline filter out of the point-estimate sample unless you also document it as a point-estimate deviation and show parity/hand-worked evidence for the altered estimator.
Commit the missing methodology-review markdown files and strengthen the new tests so they actually fail on grand-mean recentering, wrong λ construction, and SE mismatches versus R on pure-direction designs.

Addresses three methodology issues in the dCDH Phase 1 implementation: 1. Cell-count weighting (P1): the per-period DID and placebo formulas were computing N_{a,b,t} as the sum of within-cell observation counts instead of the count of (g, t) cells. This silently changed the estimand on individual-level inputs with uneven cell sizes. Fixed by replacing n_curr[mask].sum() with int(mask.sum()) and using simple .mean() over cells instead of n_curr-weighted averages. The fix matches the AER 2020 Theorem 3 cell-count notation literally and matches R DIDmultiplegtDYN's default behavior on cell-aggregated input. Renamed the results field n_switcher_obs -> n_switcher_cells for accuracy. Test data uses balanced 1-obs-per-cell panels so the regression test_cell_count_weighting_unbalanced_input is needed to guard the unbalanced-input path. 2. Full Lambda^G_{g,l=1} influence function (P1): the analytical SE path was using a "first-switch only" shortcut where each switching group's U^G_g came from its own switch period only. This omitted the periods where switching groups serve as stable controls for other cohorts. Replaced with _compute_full_per_group_contributions, which sums role-weighted outcome differences (joiner +1, stable_0 -n_10/n_00, leaver -1, stable_1 +n_01/n_11) across all periods. The bootstrap path inherits the corrected vectors automatically. SE parity vs R DIDmultiplegtDYN on pure-direction scenarios narrowed from ~18% to ~3%; pure-direction parity tests now assert SE within 5% rtol. 3. Singleton-baseline filter at variance stage only (P1): footnote 15 of the dynamic paper says singleton-baseline groups have no cohort peer for the cohort-recentered variance. The previous code applied the filter as a point-estimate filter (dropping the groups before the per-period DIDs), which under Python's documented period-based stable-control interpretation could change DID_M when a unique- baseline always-treated or never-treated group would otherwise serve as a valid stable control. Fixed by computing the singleton-baseline group set in fit() and passing it to the variance helper, which excludes those groups from the cohort enumeration and U vector only. The cell DataFrame retains them. Updated TestSingletonBaselineFilter to assert the variance-only semantics and the warning text. Documentation: - doc-deps.yaml and ROADMAP.md no longer link to gitignored paper review files (REGISTRY.md is the canonical source). - REGISTRY.md updated with cell-count notation, the full IF formula, the variance-only singleton-filter scope, the new SE parity gap numbers (3% vs the prior 18%), and a clearer period-vs-cohort deviation note. Tests: - test_cohort_recentering_not_grand_mean now actually computes both formulas on the same data and asserts they differ materially. The prior version only checked overall_se > 0. - test_cell_count_weighting_unbalanced_input pins the cell-count contract on a deliberately unbalanced individual-level panel. - test_parity_joiners_only and test_parity_leavers_only now assert SE parity at 5% rtol (was previously skipped). - 97 dCDH tests passing (96 + the new cell-count regression). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

igerber · 2026-04-11T21:05:08Z

/ai-review

github-actions · 2026-04-11T21:18:26Z

🔁 AI review rerun (requested by @igerber)

Head SHA: 0d70f4d0deac473176a1c6fd467e06eb7aa810e4

Overall Assessment

⚠️ Needs changes

Static review only: this shell is missing pytest, pandas, and numpy, so I could not run the new dCDH test suite locally. Assessment is based on two unmitigated P1 findings in methodology-critical paths.

Executive Summary

The three prior P1 blockers from the last review appear resolved: Theorem 3 now uses cell counts, the full Lambda^G_{g,l=1} influence-function path is present, and the singleton-baseline filter is now variance-only.
P1 [Newly identified]: the analytical variance path still assumes a complete baseline observation and observed adjacent treatment pairs, but the estimator accepts ragged panels; missing periods can be misread as switches or crash cohort-key construction.
P1 [Newly identified]: standalone twowayfeweights() bypasses the estimator’s documented validation contract and can silently drop/round malformed treatment/outcome cells.
P2: joiner/leaver metadata is mislabeled and partially miscomputed, so summaries and to_dataframe(level="joiners_leavers") can report the wrong sample-size quantities.
P2: the parity suite still skips without live R even though committed JSON golden values are already available, which weakens regression coverage for a methodology-critical estimator.

Methodology

Affected methods: dCDH DID_M / DID_+ / DID_- variance path from Web Appendix Section 3.7.3, and the standalone TWFE diagnostic from AER 2020 Theorem 1. Re-review note: the prior findings on cell-count weighting, full IF construction, and variance-only singleton-baseline filtering look addressed in diff_diff/chaisemartin_dhaultfoeuille.py:L511-L554, diff_diff/chaisemartin_dhaultfoeuille.py:L1088-L1096, and diff_diff/chaisemartin_dhaultfoeuille.py:L1237-L1346.

Severity: P1 [Newly identified]. Impact: the estimator now tolerates missing (group, time) cells in the point-estimate path, but the cohort/variance path still treats raw NaN entries in D_mat as if they were observed treatment states. D_mat is built with NaN gaps for missing cells, the point-estimate path correctly gates transitions on present, but _compute_cohort_recentered_inputs() ignores that guard and detects first switches from raw D_mat[g, t] != D_mat[g, t-1]; a group missing the first global period can also hit int(np.nan) when cohort keys are built. Step 7 also uses each group’s first observed row, not guaranteed D_{g,1}, for singleton-baseline identification. This can silently misclassify cohorts and distort overall_se / joiners_se / leavers_se, or fail with an opaque error on late-entry groups. Concrete fix: after building N_mat, either explicitly reject groups missing the first global period or drop them with a user-facing warning, and compute first_switch_idx / switch_direction only on adjacent observed pairs (N_mat[:, t] > 0 and N_mat[:, t-1] > 0). Add regression tests for a missing-baseline group and an interior-gap group. Refs: diff_diff/chaisemartin_dhaultfoeuille.py:L524-L528, diff_diff/chaisemartin_dhaultfoeuille.py:L568-L583, diff_diff/chaisemartin_dhaultfoeuille.py:L1078-L1083, diff_diff/chaisemartin_dhaultfoeuille.py:L1443-L1467.
Severity: P1 [Newly identified]. Impact: the new public twowayfeweights() helper does not honor the dCDH validation rules documented in the registry and enforced in fit(). fit() explicitly rejects NaN treatment/outcome values, rejects non-binary treatment, and warns before majority-rounding within-cell mixed treatment. twowayfeweights() does none of that; it only coerces treatment numeric, then aggregates with mean()/count() and thresholds d_gt >= 0.5. That means NaN outcomes/treatments can be silently dropped by the aggregation, non-binary treatment can be silently thresholded into {0,1}, and within-cell varying treatment is rounded with no warning, yielding misleading Theorem 1 diagnostics. Concrete fix: factor the validation + cell-aggregation logic into a shared helper used by both fit() and twowayfeweights(), with identical NaN, binary-treatment, and within-cell-rounding behavior, and add tests for each case. Refs: docs/methodology/REGISTRY.md:L470-L473, docs/methodology/REGISTRY.md:L581-L583, diff_diff/chaisemartin_dhaultfoeuille.py:L392-L456, diff_diff/chaisemartin_dhaultfoeuille.py:L1765-L1809.

Code Quality

Severity: P2. Impact: joiner/leaver sample-size metadata is mislabeled and partially wrong. n_joiner_cells / n_leaver_cells are computed with count_nonzero(...), which counts periods containing any joiners/leavers rather than the number of switching (g,t) cells, while n_joiner_obs / n_leaver_obs are populated from switcher-cell totals but documented and displayed as “observations”. That makes the printed summary and to_dataframe(level="joiners_leavers") report inconsistent sample-size semantics. Concrete fix: set n_joiner_cells = int(n_10_t_arr.sum()) and n_leaver_cells = int(n_01_t_arr.sum()), then either rename n_*_obs to reflect switching-cell counts or compute true raw-observation totals from the underlying N_mat. Refs: diff_diff/chaisemartin_dhaultfoeuille.py:L647-L651, diff_diff/chaisemartin_dhaultfoeuille_results.py:L148-L162, diff_diff/chaisemartin_dhaultfoeuille_results.py:L488-L506, diff_diff/chaisemartin_dhaultfoeuille_results.py:L633-L666.

Performance

No material findings in the changed performance-sensitive paths.

Maintainability

Severity: P3. Impact: summary() still reports never-switching groups under “Groups dropped before estimation”, but the methodology registry now says n_groups_dropped_never_switching is backward-compat metadata and no longer represents an actual exclusion. That makes the results summary internally inconsistent with the documented estimator contract. Concrete fix: rename that block to something like “Group filter / metadata counts” and label the never-switching count as “reported, not dropped”. Refs: diff_diff/chaisemartin_dhaultfoeuille_results.py:L412-L429, docs/methodology/REGISTRY.md:L476-L476.

Tech Debt

No separate tech-debt finding, but TODO.md does not currently track either P1 above, so they are not mitigated under the project’s review rules. Ref: TODO.md:L51-L98.

Security

No material security findings in the changed files.

Documentation/Tests

Severity: P2. Impact: the parity tests already load committed JSON golden values, but every scenario still takes require_r_dcdh, so the whole regression suite skips when live R / DIDmultiplegtDYN is unavailable. That leaves the main methodology-regression checks inactive on ordinary Python-only environments even though the assertions themselves do not need R once the goldens are committed. Concrete fix: keep the R availability check only for the golden-value generation workflow, and let the JSON-backed parity assertions run whenever benchmarks/data/dcdh_dynr_golden_values.json exists. Refs: tests/test_chaisemartin_dhaultfoeuille_parity.py:L40-L58, tests/test_chaisemartin_dhaultfoeuille_parity.py:L122-L184, tests/conftest.py:L107-L178.
Severity: P3. Impact: the docs are internally inconsistent about the dCDH paper-review markdowns. ROADMAP.md now says the primary papers are not committed in-repo, while CHANGELOG.md says three dCDH paper reviews were committed under docs/methodology/papers/. Concrete fix: align the changelog with the roadmap, or add the promised review files. Refs: ROADMAP.md:L127-L129, CHANGELOG.md:L22-L22.

Path to Approval

Make the Section 3.7.3 variance/cohort path robust to ragged panels: either reject groups missing the baseline period with a clear ValueError, or base baseline/switch detection on observed adjacent pairs from N_mat instead of raw NaN comparisons. Add regression tests for a late-entry group and for an interior missing period.
Make twowayfeweights() reuse the same validation and cell-aggregation contract as fit(): reject NaN treatment/outcome, reject non-binary treatment before aggregation, and emit the same warning on within-cell mixed treatment. Add one test for each case.

Addresses two new P1s, two P2s, and two P3s from CI re-review: P1: ragged panel handling - Add fit() Step 5a validation: reject groups missing the first global period with a clear ValueError listing offenders, and drop groups with interior period gaps with an explicit UserWarning. Eliminates the silent NaN propagation in _compute_cohort_recentered_inputs that could crash the cohort enumeration with int(np.nan) or misclassify late-entry groups' baselines. - Fix singleton-baseline computation in fit() Step 7 to read the validated first global period explicitly instead of using groupby.first() which returned the first OBSERVED row per group. - Add defensive presence-gating in _compute_cohort_recentered_inputs: the helper now refuses to run if N_mat[:, 0] has any zero entries (a fit() validation regression tripwire), and the first-switch detection loop only counts transitions between adjacent OBSERVED periods. Even with fit() validation in place, the helper is now safe to call directly. - Add 2 regression tests: test_missing_baseline_period_raises_value_error and test_interior_gap_drops_group_with_warning. P1: twowayfeweights validation refactor - Extract _validate_and_aggregate_to_cells helper enforcing the dCDH validation contract (NaN treatment / NaN outcome / non-binary treatment all raise ValueError; within-cell varying treatment emits UserWarning before majority rounding). - Both fit() and twowayfeweights() now call the helper. Single source of truth for the validation rules, no drift between the two public entry points. - Add 4 regression tests for twowayfeweights() validation: test_twowayfeweights_rejects_nan_treatment, test_twowayfeweights_rejects_nan_outcome, test_twowayfeweights_rejects_non_binary_treatment, test_twowayfeweights_warns_on_within_cell_rounding. P2: joiner/leaver metadata - Fix n_joiner_cells = int(n_10_t_arr.sum()) (was count_nonzero counting PERIODS, not cells). Same for n_leaver_cells. - Compute n_joiner_obs and n_leaver_obs as actual observation counts (sum of n_gt over the joiner/leaver cells across periods), not as cell totals. For balanced one-obs-per-cell panels they equal n_*_cells; for individual-level inputs with multiple obs per cell they can be larger. Update results dataclass docstrings. P2: parity tests run on JSON without R - Decouple golden_values fixture from require_r_dcdh. Tests now run whenever the JSON file exists. R is only needed to regenerate the JSON via benchmarks/R/generate_dcdh_dynr_test_values.R. - Verified by running DIFF_DIFF_R=skip pytest tests/test_chaisemartin_dhaultfoeuille_parity.py — all 5 parity scenarios PASS (was previously skipping entirely). P3: summary() label rename - Rename "Groups dropped before estimation:" to "Group filter / metadata counts:". Label never-switching as "(reported, not dropped)". Reflects the Round 2 change where never-switching groups participate in the variance via stable-control roles. P3: CHANGELOG/ROADMAP consistency - Remove the CHANGELOG bullet that claimed three paper review files were committed under docs/methodology/papers/. Replace with a reference to REGISTRY.md as the canonical methodology surface (matching the ROADMAP wording). Tests: 103 dCDH passing (97 + 6 new). Worked example DID_M = 2.5 still exact. Pure-direction R parity tests still pass at 1e-4 / 5% rtol. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

igerber · 2026-04-11T21:41:45Z

/ai-review

github-actions · 2026-04-11T21:51:54Z

🔁 AI review rerun (requested by @igerber)

Head SHA: df7a0a389e78d3b82eeb3455948a9adc8d74ca12

Overall Assessment

✅ Looks good

Static review only: I could not run the new dCDH suite in this environment because pytest, numpy, and pandas are not installed.

Executive Summary

Prior re-review blockers appear addressed: ragged panels now reject missing-baseline groups / drop interior-gap groups in diff_diff/chaisemartin_dhaultfoeuille.py:L533, and twowayfeweights() now reuses the estimator’s shared validation/aggregation helper in diff_diff/chaisemartin_dhaultfoeuille.py:L94 and diff_diff/chaisemartin_dhaultfoeuille.py:L1983.
I did not find an unmitigated P0/P1 in the new DID_M point-estimate, variance, placebo, or TWFE-diagnostic paths.
The parity suite now runs from committed JSON goldens without requiring live R, which is a meaningful regression-coverage improvement for a methodology-critical estimator in tests/test_chaisemartin_dhaultfoeuille_parity.py:L11.
to_dataframe(level="joiners_leavers") still mixes cell counts and raw observation counts under one n_obs column, which is a confusing downstream contract in diff_diff/chaisemartin_dhaultfoeuille_results.py:L645.
A few user-facing docstrings/comments lag the implemented Phase 1 contract, especially around placebo bootstrap availability and never-switching groups.

Methodology

Methodology cross-check: the PR’s Phase 1 framing is consistent with the cited source material and with the registry’s current-scope definition for DID_M / DID_1; the previously flagged methodology blockers noted above also appear resolved in the implementation. (aeaweb.org)

Severity: P3 informational. Impact: the substantive deviations I found are documented rather than silent: Phase 1 placebo inference is intentionally NaN, and mixed-direction R parity is explicitly treated as a documented period-vs-cohort control-set deviation. The code and tests match that contract, so this is not a defect. Concrete fix: none. Refs: docs/methodology/REGISTRY.md:L554, docs/methodology/REGISTRY.md:L562, diff_diff/chaisemartin_dhaultfoeuille.py:L915, tests/test_chaisemartin_dhaultfoeuille_parity.py:L102.

Code Quality

Severity: P2. Impact: to_dataframe(level="joiners_leavers") exposes one n_obs column with different units by row: DID_M uses switcher-cell count, while DID_+ / DID_- use raw observation counts summed over switching cells. That makes the table easy to misread programmatically and breaks the semantic contract of a single count field. Concrete fix: split this into separate n_cells and n_obs columns, or rename the current field so all rows use the same unit consistently. Refs: diff_diff/chaisemartin_dhaultfoeuille_results.py:L645, diff_diff/chaisemartin_dhaultfoeuille.py:L796.

Performance

No material findings in the changed performance-sensitive paths.

Maintainability

No material maintainability finding beyond the documentation/contract drift noted below.

Tech Debt

No separate tech-debt blocker in this diff. The remaining issues are low-severity contract/documentation cleanup, not silent correctness bugs.

Security

No material security findings in the changed files. I did not see secrets, unsafe subprocess additions, or new network-facing behavior.

Documentation/Tests

Severity: P3. Impact: the results-object docs are stale relative to the implemented Phase 1 contract. DCDHBootstrapResults still says placebo bootstrap outputs are populated when available, but Phase 1 intentionally never surfaces placebo bootstrap inference; n_groups_dropped_never_switching still says never-switchers are excluded from variance, while the registry/code now treat that field as backward-compat metadata only. Concrete fix: update those docstrings to match current runtime behavior. Refs: diff_diff/chaisemartin_dhaultfoeuille_results.py:L45, diff_diff/chaisemartin_dhaultfoeuille_results.py:L237, docs/methodology/REGISTRY.md:L476, docs/methodology/REGISTRY.md:L554, diff_diff/chaisemartin_dhaultfoeuille.py:L952.
Severity: P3. Impact: a couple of new comments overstate what the fixtures/docs guarantee. The R golden-value script says set.seed() and NumPy default_rng() produce identical draws, and the reversible DGP doc says Assumption 5 is “automatically satisfied” for binary treatment even though the PR explicitly relies on dropping multi-switch groups for those paths. These are comment-level issues, but they can mislead future maintainers about parity and identification assumptions. Concrete fix: reword both comments to describe structural mirroring rather than RNG identity, and to say the default single-switch path is A5-safe while multi-switch patterns are stress tests for the drop filter. Refs: benchmarks/R/generate_dcdh_dynr_test_values.R:L37, diff_diff/prep_dgp.py:L1877.

…trings) P2: split joiners_leavers DataFrame into n_cells + n_obs columns - to_dataframe(level="joiners_leavers") previously had a single n_obs column with mixed semantics by row (DID_M used switcher cell count; DID_+/DID_- used raw observation counts). Two columns with consistent units across all rows: n_cells (count of switching (g, t) cells) and n_obs (sum of n_gt over the same cells). DID_M row uses union of joiner + leaver cells. Updated test_to_dataframe_joiners_leavers to pin the new contract. P3: stale docstrings on results object - DCDHBootstrapResults class docstring now states explicitly that placebo bootstrap fields ALWAYS remain None in Phase 1 (the previous wording said they were "populated when available"). Per-field docstrings for placebo_se / placebo_ci / placebo_p_value now point back to the class-level note. - n_groups_dropped_never_switching docstring now reflects the Round 2 full-IF fix: never-switching groups participate in the variance via stable-control roles and the field is reported for backwards compatibility only — no actual exclusion happens. - n_groups_dropped_singleton_baseline docstring clarifies the variance-only filter scope (cell DataFrame retains them as period-based stable controls). P3: misleading R-script + prep_dgp comments - benchmarks/R/generate_dcdh_dynr_test_values.R: clarified that the Python and R generators mirror each other STRUCTURALLY (same pattern logic, same FE/effect/noise model), not at the RNG level. R's set.seed and NumPy's default_rng use different RNGs. Parity tests load the R script's golden-value JSON so both sides operate on byte-identical input regardless of how it was originally generated. - prep_dgp.py generate_reversible_did_data: clarified that the default single_switch pattern is A5-safe by construction (every group has at most one transition). Other patterns (random/cycles/marketing) ARE allowed to violate A5 and exist primarily as stress tests for the drop_larger_lower=True filter. The cohort-recentered variance formula is derived under A5, which is why drop_larger_lower defaults to True. Tests: 103 dCDH passing (no new tests; the existing test_to_dataframe_joiners_leavers was strengthened to assert the new n_cells / n_obs contract). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

igerber · 2026-04-11T22:24:09Z

/ai-review

github-actions · 2026-04-11T22:37:04Z

🔁 AI review rerun (requested by @igerber)

Head SHA: 787eae20a2aab0a52701ab1d6f3ea88ed449e895

Overall Assessment

Needs changes

Static review only: pytest, numpy, and pandas are unavailable in this environment, so I could not execute the new dCDH suite.

Executive Summary

The prior re-review items I spot-checked look addressed in the current diff: shared validation in twowayfeweights(), the joiners_leavers count contract, and the previously stale dCDH docstrings/comments.
P1 [Newly identified]: the analytical variance path can collapse to overall_se = 0.0 on valid panels where every exact (D_{g,1}, F_g, S_g) cohort is a singleton. The PR’s own 4-group worked example appears to hit that case, while the committed R golden file records overall_se = 1.8027756377.
P1 [Newly identified]: the placebo path silently applies A11 zero-retention when placebo switchers lack eligible stable controls, but it does not expose the warning/flags that the main estimator exposes.
The documented period-vs-cohort control-set deviation in REGISTRY.md does not mitigate the SE issue; the registry explicitly treats the 4-group worked example as a case where that deviation should not matter.
The new tests pin ATT parity on the worked example, but they still do not pin its SE or the placebo A11 warning path, which is why the above issues can escape.

Methodology

Severity: P1 [Newly identified]. Impact: _compute_cohort_recentered_inputs() builds exact cohort IDs from (D_{g,1}, F_g, S_g) and then _cohort_recenter() subtracts the mean within that exact cohort, but the only exclusion rule implemented/documented is the baseline-singleton filter from footnote 15. On the canonical 4-group worked example, that leaves four distinct exact cohorts for four groups, so the centered IF vector collapses to all zeros before _plugin_se() runs. That yields an analytical SE of 0.0 by the current logic on a valid Phase 1 panel, while the PR’s own committed R golden value on the same panel is 1.8027756377. This is an incorrect variance/SE path, not a documented deviation. Concrete fix: reconcile the cohorting/exclusion logic with the documented source material/reference implementation so exact-singleton cohorts do not silently force zero variance; at minimum, detect this case and return NaN plus a warning instead of 0.0, then pin the worked-example SE/CI against the committed golden value. Refs: diff_diff/chaisemartin_dhaultfoeuille.py:L1513-L1535, diff_diff/chaisemartin_dhaultfoeuille.py:L1643-L1739, docs/methodology/REGISTRY.md:L516-L540, docs/methodology/REGISTRY.md:L560-L562, tests/test_methodology_chaisemartin_dhaultfoeuille.py:L45-L68, benchmarks/data/dcdh_dynr_golden_values.json:L115-L137
Severity: P1 [Newly identified]. Impact: REGISTRY.md says the placebo applies the same Theorem 3 logic, and the registered A11 contract says zero-retention is surfaced to users via flags/warnings. The main DID path does that, but _compute_placebo() silently sets placebo_plus_t = 0.0 / placebo_minus_t = 0.0 whenever placebo joiners/leavers lack stable controls, and fit() only warns when the entire placebo is unavailable. That can attenuate the placebo diagnostic toward zero with no user-visible signal. Concrete fix: either add placebo-side A11 flags plus a consolidated warning, mirroring _compute_per_period_dids(), or mark the placebo unavailable and return NaN whenever required placebo control sets are missing; document whichever contract you choose in REGISTRY.md. Refs: docs/methodology/REGISTRY.md:L507-L514, docs/methodology/REGISTRY.md:L558-L558, diff_diff/chaisemartin_dhaultfoeuille.py:L820-L846, diff_diff/chaisemartin_dhaultfoeuille.py:L1196-L1307, diff_diff/chaisemartin_dhaultfoeuille.py:L1323-L1398

Code Quality

No separate code-quality finding beyond the methodology defects above.

Performance

No material performance findings in the changed paths reviewed.

Maintainability

No separate maintainability finding beyond the methodology/test contract issues below.

Tech Debt

Severity: P3 informational. Impact: I did not find any TODO.md entry that tracks either methodology issue above, so they remain unmitigated under the project’s deferment rules and still count toward the assessment. Concrete fix: resolve these in code in this PR; only any leftover non-blocking cleanup should be deferred into TODO.md afterward. Refs: TODO.md:L51-L90

Security

No material security findings in the changed files reviewed. I did not see secrets, unsafe subprocess additions, or new network-facing behavior.

Documentation/Tests

Severity: P2. Impact: the new methodology/parity coverage still does not assert the worked-example SE, only the ATT, and there is no regression for placebo A11 visibility. That leaves both methodology issues above undetected despite the large new test surface. Concrete fix: extend the worked-example tests to assert overall_se and CI against the committed golden JSON, and add a placebo case with switchers but no eligible 3-period stable controls on one side to pin the warning/flag or NaN contract. Refs: tests/test_methodology_chaisemartin_dhaultfoeuille.py:L56-L68, tests/test_chaisemartin_dhaultfoeuille_parity.py:L185-L200, tests/test_chaisemartin_dhaultfoeuille.py:L597-L631, tests/test_chaisemartin_dhaultfoeuille.py:L771-L821, benchmarks/data/dcdh_dynr_golden_values.json:L115-L137

Path to Approval

Fix the analytical SE path so valid panels with singleton exact (D_{g,1}, F_g, S_g) cohorts do not silently collapse to overall_se = 0.0; if the variance is genuinely unidentified after the intended paper-consistent filtering, surface NaN plus an explicit warning instead.
Make placebo A11 handling explicit by either adding placebo A11 flags plus a consolidated warning or by marking the placebo unavailable when the required placebo control sets are missing.
Add regression coverage for both cases: assert the worked-example overall_se/CI against the committed golden JSON, and add a placebo panel that exercises missing placebo controls on one side.

Addresses two new P1s + a P2 from CI re-review: P1: degenerate-cohort SE collapses to 0.0 - _plugin_se now returns NaN when sum(U_centered**2) <= 0 (in addition to the existing n==0 and divisor<=0 guards). This catches the case where every variance-eligible group forms its own (D_{g,1}, F_g, S_g) cohort, so cohort recentering produces an identically-zero centered IF vector. Returning NaN rather than 0.0 prevents the silently-implies-infinite-precision failure mode. - fit() now detects this case after the analytical SE call and emits a UserWarning explaining that the variance is unidentified, the point estimate is still valid, and the bootstrap path inherits the same degeneracy on the same data. - Worked example test now asserts overall_se is NaN with the warning, alongside the existing DID_M = 2.5 exact assertion. P1: placebo A11 silent zeroing - _compute_placebo now mirrors the main path's A11 distinction: "no joiners" stays a natural zero (no flag), while "joiners but no 3-period stable_0 controls" sets the placebo period contribution to zero AND appends a warning string to placebo_a11_warnings. Symmetric for leavers/stable_1. - _compute_placebo's return signature widened from Optional[Tuple[float, bool]] to Optional[Tuple[float, bool, List[str]]] to expose the A11 warnings to the caller. - fit() unpacks the new tuple and surfaces a consolidated "Placebo (DID_M^pl) Assumption 11 violations" warning when any placebo period was zeroed by the A11 contract. Mirrors the main DID path's warning format. - New test_placebo_a11_violation_emits_warning constructs a panel where placebo joiners exist but no placebo stable_0 controls do (4-group T=3 with 2 joiners + 2 always-treated controls), asserts the placebo A11 warning fires. REGISTRY.md: two new Note blocks - The degenerate-cohort SE contract (NaN + warning, deviation from R) - The placebo A11 zero-retention contract (mirrors main DID path) Tests: 105 dCDH passing (was 103, added 2 new). The existing test_hand_calculable_4group_3period_joiners_and_leavers wraps fit() in a warnings.catch_warnings(simplefilter("ignore")) block since the degenerate-cohort warning now fires on this panel; the dedicated test_worked_example_se_is_unidentified_with_warning asserts the warning explicitly. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

igerber · 2026-04-11T22:57:55Z

/ai-review

github-actions · 2026-04-11T23:15:42Z

🔁 AI review rerun (requested by @igerber)

Head SHA: 1e05c9d86fb2cb12451f888fc0e123c3064f2594

Overall Assessment

⚠️ Needs changes

Static review only. I could not execute the new dCDH suite here because pytest, numpy, and pandas are unavailable in this environment.

Executive Summary

The prior re-review blockers I spot-checked look addressed: degenerate singleton-cohort analytical SE now surfaces NaN plus a warning instead of 0.0, and placebo A11 zero-retention now emits a consolidated warning, with both behaviors documented and tested.
P1 [Newly identified]: ChaisemartinDHaultfoeuille.fit() now errors on groups missing the first global period and drops groups with interior period gaps before computing DID_M, DID_+, DID_-, or DID_M^pl, but the dCDH registry/docs do not document that sample restriction even though the PR also presents DIDmultiplegtDYN as the reference implementation and that manual explicitly supports unbalanced panels. diff_diff/chaisemartin_dhaultfoeuille.py:L534-L607, docs/methodology/REGISTRY.md:L468-L478, docs/methodology/REGISTRY.md:L544-L569, docs/api/chaisemartin_dhaultfoeuille.rst:L18-L43, tests/test_chaisemartin_dhaultfoeuille.py:L462-L507. citeturn0view0turn0view1
The documented Phase 1 placebo-SE=NaN contract and the documented degenerate-cohort overall_se=NaN behavior are not blockers.
P3: several changed docs/comments still describe never-switching groups as filtered from variance or as true “filter counts,” while the implementation and main registry note now say they are retained and that field is metadata only.

Methodology

Severity: P1 [Newly identified]. Impact: The affected methods are the full Phase 1 dCDH path: DID_M, DID_+, DID_-, and DID_M^pl, because Step 5a changes the estimation sample before any estimand is computed. The dynamic companion paper’s setup is written for fully observed group-time histories, so a strict full-history contract is defensible; however, the official DIDmultiplegtDYN manual explicitly allows unbalanced panels and documents dedicated handling for missing treatment before first switch. This PR instead rejects late-entry groups and drops interior-gap groups, yet the dCDH Methodology Registry and API docs do not document that restriction/deviation even though they do document other R deviations and advertise parity with DIDmultiplegtDYN. Under the project rules, that is an undocumented default-behavior change and should be treated as P1. Concrete fix: either document the exact ragged-panel contract in the dCDH registry/docs as an explicit Phase 1 requirement/deviation, or implement the reference package’s unbalanced-panel handling. diff_diff/chaisemartin_dhaultfoeuille.py:L534-L607, docs/methodology/REGISTRY.md:L468-L478, docs/methodology/REGISTRY.md:L544-L569, docs/api/chaisemartin_dhaultfoeuille.rst:L18-L43, tests/test_chaisemartin_dhaultfoeuille.py:L462-L507. citeturn0view0turn0view1
Severity: P3 informational. Impact: The previous re-review methodology findings appear resolved. _plugin_se() now returns NaN on the all-zero centered-IF case, fit() surfaces that with a warning, placebo A11 violations now emit a consolidated warning, and both behaviors are documented in REGISTRY.md and pinned in tests. Concrete fix: none. diff_diff/chaisemartin_dhaultfoeuille.py:L847-L867, diff_diff/chaisemartin_dhaultfoeuille.py:L894-L919, diff_diff/chaisemartin_dhaultfoeuille.py:L1796-L1834, docs/methodology/REGISTRY.md:L554-L558, tests/test_methodology_chaisemartin_dhaultfoeuille.py:L75-L117, tests/test_chaisemartin_dhaultfoeuille.py:L632-L691

Code Quality

No separate findings beyond the methodology item above.

Performance

No material performance findings in the changed paths reviewed.

Maintainability

No separate maintainability findings beyond the documentation drift noted below.

Tech Debt

Severity: P3 informational. Impact: I did not find a TODO.md entry tracking the P1 above, so it remains unmitigated under the project’s deferment rules. Concrete fix: resolve or document it in this PR rather than deferring it. TODO.md:L51-L95

Security

No material security findings in the changed files reviewed.

Documentation/Tests

Severity: P3. Impact: Several changed docs/comments still describe never-switching groups as a real variance filter / warning path, while the implementation and main dCDH registry text now say never-switchers remain in the variance and n_groups_dropped_never_switching is backwards-compat metadata only. That is misleading but not a numerical defect. Concrete fix: align the changelog bullet, README field description, dCDH registry checklist item, and stale test comments with the actual Round 2 behavior. CHANGELOG.md:L12-L19, README.md:L1206-L1213, docs/methodology/REGISTRY.md:L474-L476, docs/methodology/REGISTRY.md:L580-L583, diff_diff/chaisemartin_dhaultfoeuille_results.py:L250-L256, tests/test_methodology_chaisemartin_dhaultfoeuille.py:L155-L160

Path to Approval

Decide the intended dCDH unbalanced/ragged-panel contract and document it in docs/methodology/REGISTRY.md: if Phase 1 requires first-period observation plus consecutive periods, say so explicitly and label it as a deviation/limitation relative to DIDmultiplegtDYN; otherwise, implement the reference package’s unbalanced-panel handling.
Mirror that contract in the public API/README text that currently advertises DIDmultiplegtDYN parity, so users do not expect R-like behavior on late-entry or interior-gap panels.
Add one regression test that pins the remaining unspoken branch of the contract, e.g. whether terminal missingness / late exit is retained or rejected, so the documented behavior is fully specified.

Documents the dCDH Phase 1 ragged-panel handling contract that was implemented in Round 3 but not surfaced in any user-facing doc: - Balanced baseline required: groups missing the first global period raise ValueError with offending IDs - Interior period gaps: groups dropped with UserWarning - Terminal missingness (early exit / right-censoring): retained, contributes from observed periods only via the per-period present guard at three sites in the variance computation This is a Phase 1 limitation relative to R DIDmultiplegtDYN, which supports unbalanced panels with documented missing-treatment-before- first-switch handling. Added a new **Note (deviation from R)** block in REGISTRY.md, mirrored in the API rst, README, and CHANGELOG so users do not expect R-like behavior on late-entry or interior-gap panels. Also fixes doc drift on never-switching groups: after the Round 2 full-IF fix they participate in the variance via stable-control roles and are no longer filtered, but CHANGELOG/README/REGISTRY checklist and a methodology test comment still described them as filtered. The n_groups_dropped_never_switching field stays as backwards-compat metadata only (already documented correctly in the dataclass docstring since Round 2). Adds test_terminal_missingness_retained as a regression test pinning the contract: a group with periods [0, 1, 2] in a 5-period panel is retained in results.groups, the fit completes without error, and the per-period DIDs are populated. Files modified: - docs/methodology/REGISTRY.md (assumption check + new deviation Note + checklist) - docs/api/chaisemartin_dhaultfoeuille.rst (Phase 1 panel requirements section) - README.md (Note block + n_groups_dropped_never_switching field row split) - CHANGELOG.md (panel requirement sub-bullet + line 17 never-switching fix) - tests/test_chaisemartin_dhaultfoeuille.py (test_terminal_missingness_retained) - tests/test_methodology_chaisemartin_dhaultfoeuille.py (fix test comment) Test count: 105 -> 106 (one new regression test). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

igerber · 2026-04-11T23:50:45Z

/ai-review

github-actions · 2026-04-12T00:00:39Z

🔁 AI review rerun (requested by @igerber)

Head SHA: cf639d44a87b2f2c322284bd3a289578c473b546

Overall Assessment

⚠️ Needs changes

Static review only. I could not execute the added dCDH tests in this runner because pytest and numpy are not installed here, although the project declares them and CI installs them in normal jobs. pyproject.toml:L48-L57, .github/workflows/rust-test.yml:L191-L199

Executive Summary

The prior ragged/unbalanced-panel documentation blocker looks resolved: the balanced-baseline, interior-gap, and terminal-missingness contract is now explicitly documented in the dCDH registry and user-facing docs. docs/methodology/REGISTRY.md:L477-L477, docs/api/chaisemartin_dhaultfoeuille.rst:L52-L72, README.md:L1233-L1233, CHANGELOG.md:L18-L18
Severity P1 [Newly identified]: Phase 1 exposes a public cluster parameter, but the dCDH analytical and bootstrap inference paths never read or validate it, so cluster=... silently does nothing and still returns the default group-level inference. diff_diff/chaisemartin_dhaultfoeuille.py:L249-L252, diff_diff/chaisemartin_dhaultfoeuille.py:L320-L377, diff_diff/chaisemartin_dhaultfoeuille.py:L412-L1118, diff_diff/chaisemartin_dhaultfoeuille_bootstrap.py:L62-L233, docs/methodology/REGISTRY.md:L517-L543, docs/methodology/REGISTRY.md:L583-L587
I did not find new inline inference / partial-NaN regressions in the changed dCDH paths; the estimator uses safe_inference() consistently. diff_diff/chaisemartin_dhaultfoeuille.py:L920-L942, diff_diff/chaisemartin_dhaultfoeuille.py:L1023-L1038
Severity P3: the API overview still says singleton-baseline groups are “filtered,” while the implementation and methodology registry now treat them as variance-only exclusions. docs/api/chaisemartin_dhaultfoeuille.rst:L20-L23, diff_diff/chaisemartin_dhaultfoeuille.py:L659-L703, diff_diff/chaisemartin_dhaultfoeuille_results.py:L243-L249, docs/methodology/REGISTRY.md:L475-L475

Methodology

The previous ragged-panel methodology blocker appears resolved; I do not see a remaining finding there. docs/methodology/REGISTRY.md:L477-L477, docs/api/chaisemartin_dhaultfoeuille.rst:L52-L72

Severity: P1 [Newly identified]. Impact: Affected method(s): Phase 1 dCDH inference for DID_M, DID_+, and DID_-. The dCDH registry and in-code references document cohort-recentered analytical SEs plus an optional multiplier bootstrap clustered at the group level, but the public constructor and sklearn-style params still expose cluster. Inside ChaisemartinDHaultfoeuille.fit() and _compute_dcdh_bootstrap(), that parameter is never read or validated, so cluster="state" silently produces the same default group-level inference as cluster=None. Under the project’s anti-pattern rules, this is a new __init__ parameter missing downstream and an undocumented dCDH methodology/API mismatch in the canonical registry section. Concrete fix: prefer raising NotImplementedError or ValueError whenever cluster is not None in Phase 1; alternatively, implement cluster-aware analytical/bootstrap inference end-to-end and document it in the dCDH REGISTRY.md section. diff_diff/chaisemartin_dhaultfoeuille.py:L249-L252, diff_diff/chaisemartin_dhaultfoeuille.py:L320-L377, diff_diff/chaisemartin_dhaultfoeuille.py:L412-L1118, diff_diff/chaisemartin_dhaultfoeuille_bootstrap.py:L62-L233, docs/methodology/REGISTRY.md:L517-L543, docs/methodology/REGISTRY.md:L583-L587

Code Quality

No additional findings beyond the public cluster no-op. The changed dCDH inference path consistently routes through safe_inference(). diff_diff/chaisemartin_dhaultfoeuille.py:L920-L942, diff_diff/chaisemartin_dhaultfoeuille.py:L1023-L1038

Performance

No material findings in the changed paths I reviewed.

Maintainability

Severity: P3. Impact: The fitted TWFE diagnostic claims to run on the “FULL pre-filter” sample, but in fit() it is computed after the Step 5a interior-gap drop, while standalone twowayfeweights() runs on the full aggregated input. On ragged panels, the two APIs can therefore diverge even though the in-code contract and test comments say otherwise. Concrete fix: either compute the fit-path diagnostic before Step 5a filtering, or update the comments/tests/docs to say that the fitted diagnostic uses the estimation sample. diff_diff/chaisemartin_dhaultfoeuille.py:L617-L625, diff_diff/chaisemartin_dhaultfoeuille.py:L2041-L2089, tests/test_chaisemartin_dhaultfoeuille.py:L1175-L1177

Tech Debt

Severity: P3. Impact: I did not find a dCDH entry in TODO.md that would mitigate the cluster issue, so it remains unmitigated under the project’s deferment rules. Concrete fix: resolve or explicitly document/gate the behavior in this PR rather than deferring it. TODO.md:L51-L97

Security

No material security findings in the changed files reviewed.

Documentation/Tests

Severity: P3. Impact: The API overview still says the estimator “Filters singleton-baseline groups,” which reads as a point-estimate sample drop. The implementation and methodology registry now say these groups are excluded from the variance computation only and remain available as stable controls in the point-estimate sample. Concrete fix: change step 3 in the API overview to say “Excludes singleton-baseline groups from the variance computation only.” docs/api/chaisemartin_dhaultfoeuille.rst:L20-L23, diff_diff/chaisemartin_dhaultfoeuille.py:L659-L703, diff_diff/chaisemartin_dhaultfoeuille_results.py:L243-L249, docs/methodology/REGISTRY.md:L475-L475

Path to Approval

Choose a Phase 1 cluster contract and make it explicit in code: either reject any non-None cluster with a clear error, or implement cluster-aware analytical/bootstrap inference end-to-end.
If cluster is intentionally forward-compatible only, add an explicit dCDH **Note:** in docs/methodology/REGISTRY.md and mirror that contract in the public docs so the behavior is not silent.
Add a regression test that pins the chosen cluster behavior, e.g. ChaisemartinDHaultfoeuille(cluster="state").fit(...) should either raise a clear Phase 1 message or exercise documented cluster-aware inference.

…nguage Resolves three findings from the prior CI review: P1 — cluster parameter was a public no-op. The constructor exposed `cluster: Optional[str] = None` and stored it on `self.cluster`, but neither `fit()` nor `_compute_dcdh_bootstrap()` ever read it, so `cluster="state"` silently produced the same group-level inference as `cluster=None`. The Phase 1 contract is now: dCDH always clusters at the group level (via the cohort-recentered influence function and the multiplier bootstrap), and any non-None cluster value raises `NotImplementedError` at construction time. The same gate fires from `set_params`. Added `test_cluster_parameter_raises_not_implemented` covering all four entry points (`__init__`, `set_params`, the `cluster=None` happy path, and the convenience function). P3 — TWFE diagnostic was running on the post-Step-5a sample but the inline comment claimed "FULL pre-filter cell dataset" and the standalone `twowayfeweights()` function used the truly pre-filter cell. On ragged panels with interior gaps, the two paths diverged. Fixed by reordering: the TWFE diagnostic now runs as Step 5a (was 5b) BEFORE the ragged-panel validation, which is now Step 5b (was 5a). The blocks are independent — the diagnostic just reads `cell` while the ragged-panel block mutates it. Both API surfaces now operate on the same `_validate_and_aggregate_to_cells()` output. P3 — API rst step 3 said "Filters singleton-baseline groups" which read as a point-estimate sample drop. After Round 3, singleton- baseline groups are excluded from the variance computation only (they remain in the point-estimate sample as period-based stable controls). Fixed the language to match. Documentation: - REGISTRY.md gets a new `**Note (Phase 1 cluster contract):**` block in the dCDH section explaining the always-group-level semantics and the construction-time NotImplementedError gate. - API rst step 3 reflects the variance-only singleton-baseline scope. - The dCDH `cluster` parameter docstring now describes the Phase 1 contract instead of "Currently unused — analytical SEs are always at the group level via the cohort-recentered plug-in." Test counts: 106 -> 107 (one new cluster gate regression test). Files modified: - diff_diff/chaisemartin_dhaultfoeuille.py (cluster gate in __init__ and set_params, docstring update, TWFE diagnostic block reorder with renumbered Step 5a/5b labels) - docs/methodology/REGISTRY.md (cluster contract Note + rename Step 5a -> Step 5b in the ragged-panel deviation Note) - docs/api/chaisemartin_dhaultfoeuille.rst (singleton-baseline language fix in step 3 of the overview) - tests/test_chaisemartin_dhaultfoeuille.py (test_cluster_parameter_raises_not_implemented in TestForwardCompatGates; rename Step 5a -> Step 5b in three ragged-panel test docstrings) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

igerber · 2026-04-12T00:26:52Z

/ai-review

github-actions · 2026-04-12T00:33:59Z

🔁 AI review rerun (requested by @igerber)

Head SHA: 83cc093c82faeaac67b9dadceeefbc3dfe10b5e9

Overall Assessment

✅ Looks good

This re-review clears the prior P1 blocker: cluster is no longer a silent no-op in Phase 1, and I did not find a new unmitigated P0/P1 in the changed dCDH estimator or its supporting files. Static review only; I could not execute the added tests here because pytest is not installed in this runner.

Executive Summary

The previous cluster blocker is fixed: non-None cluster now raises in both __init__ and set_params, matching the registry and the new regression test (diff_diff/chaisemartin_dhaultfoeuille.py:L352, diff_diff/chaisemartin_dhaultfoeuille.py:L423, docs/methodology/REGISTRY.md:L559, tests/test_chaisemartin_dhaultfoeuille.py:L326).
The affected methods, DID_M / DID_+ / DID_- / DID_M^pl, align with the registry on Theorem 3 cell-count weighting, cohort-recentered variance, A11 zero-retention, placebo NaN inference in Phase 1, and the balanced-baseline/interior-gap/terminal-missingness contract (docs/methodology/REGISTRY.md:L480, docs/methodology/REGISTRY.md:L517, docs/methodology/REGISTRY.md:L555, docs/methodology/REGISTRY.md:L571, diff_diff/chaisemartin_dhaultfoeuille.py:L1327, diff_diff/chaisemartin_dhaultfoeuille.py:L1506, diff_diff/chaisemartin_dhaultfoeuille.py:L998, diff_diff/chaisemartin_dhaultfoeuille.py:L609).
I did not find new inline-inference or partial-NaN regressions; the changed estimator path uses safe_inference() consistently (diff_diff/chaisemartin_dhaultfoeuille.py:L950, diff_diff/chaisemartin_dhaultfoeuille.py:L1058, diff_diff/utils.py:L152).
Severity P3: docs/llms-full.txt still says cluster is “Reserved for future use,” which is stale now that Phase 1 rejects any non-None value (docs/llms-full.txt:L238, diff_diff/chaisemartin_dhaultfoeuille.py:L352, docs/methodology/REGISTRY.md:L559).
Severity P3: the bootstrap helper docstrings still describe the re-aggregation divisor as a group count, while the implementation and registry correctly use switching-cell denominators; this is documentation drift, not a numeric bug (diff_diff/chaisemartin_dhaultfoeuille_bootstrap.py:L75, diff_diff/chaisemartin_dhaultfoeuille_bootstrap.py:L90, diff_diff/chaisemartin_dhaultfoeuille_bootstrap.py:L104, diff_diff/chaisemartin_dhaultfoeuille.py:L1038, docs/methodology/REGISTRY.md:L492).

Methodology

Affected methods: Phase 1 dCDH DID_M, DID_+, DID_-, DID_M^pl, plus the standalone Theorem 1 twowayfeweights() diagnostic.

No unmitigated P0/P1 findings. The changed implementation matches the registry on the core Theorem 3 / Theorem 4 formulas, the Section 3.7.3 cohort-recentered variance, and the explicitly documented deviations from R, so I am not treating those documented differences as defects (docs/methodology/REGISTRY.md:L480, docs/methodology/REGISTRY.md:L517, docs/methodology/REGISTRY.md:L565, docs/methodology/REGISTRY.md:L569, diff_diff/chaisemartin_dhaultfoeuille.py:L785, diff_diff/chaisemartin_dhaultfoeuille.py:L1506).

Code Quality

No material findings. The changed dCDH inference path uses safe_inference() rather than inline t-stat / p-value / CI logic, so I did not find new partial-NaN or inline-inference anti-patterns (diff_diff/chaisemartin_dhaultfoeuille.py:L950, diff_diff/utils.py:L152).

Performance

No material findings from static review. I did not benchmark in this runner.

Maintainability

Severity: P3. Impact: _compute_dcdh_bootstrap() still documents the bootstrap divisor as a “count of contributing groups” and illustrates the step as W @ u_centered / n_groups_target, but the live estimator passes switching-cell denominators (N_S, joiner cells, leaver cells) per the registry. The code computes the right numbers today, but the helper documentation is now semantically misleading for future maintainers (diff_diff/chaisemartin_dhaultfoeuille_bootstrap.py:L75, diff_diff/chaisemartin_dhaultfoeuille_bootstrap.py:L90, diff_diff/chaisemartin_dhaultfoeuille_bootstrap.py:L104, diff_diff/chaisemartin_dhaultfoeuille.py:L1041, docs/methodology/REGISTRY.md:L497).
Concrete fix: rename the parameter/documentation from n_groups_* to divisor or n_switcher_cells_*, and update the helper docstring formula to describe cell-count denominators explicitly.

Tech Debt

Severity: P3 (tracked in TODO.md). Impact: the broader gap that allowed the .txt AI guide to drift is already tracked: .txt documentation is outside current smoke validation, so this does not block the PR under the repo’s deferment rules (TODO.md:L93, docs/llms-full.txt:L238).
Concrete fix: no blocker for this PR; follow the existing TODO by extending doc-snippet or consistency checks to docs/*.txt.

Security

No security findings in the changed files reviewed.

Documentation/Tests

Static only here: I could not run the added tests because pytest is unavailable in this runner. The changed suite itself does cover the key edge cases from the checklist, including missing baseline, interior gaps, terminal missingness, A11 zero-retention, placebo bootstrap deferral, and R golden-value parity (tests/test_chaisemartin_dhaultfoeuille.py:L509, tests/test_chaisemartin_dhaultfoeuille.py:L528, tests/test_chaisemartin_dhaultfoeuille.py:L556, tests/test_chaisemartin_dhaultfoeuille.py:L690, tests/test_chaisemartin_dhaultfoeuille.py:L922, tests/test_chaisemartin_dhaultfoeuille_parity.py:L123).

Severity: P3. Impact: docs/llms-full.txt still advertises cluster as “Reserved for future use,” but the actual Phase 1 contract is stricter: only cluster=None is supported, and any non-None value raises NotImplementedError. That is stale documentation, not a runtime bug, but it will mislead AI/automation consumers that use this file as a source of truth (docs/llms-full.txt:L238, diff_diff/chaisemartin_dhaultfoeuille.py:L352, docs/methodology/REGISTRY.md:L559, tests/test_chaisemartin_dhaultfoeuille.py:L326).
Concrete fix: update the dCDH constructor snippet in docs/llms-full.txt to say that cluster=None is the only supported Phase 1 value and that non-None raises NotImplementedError.

Two P3 doc-drift cleanups from the prior CI review. P3 — Bootstrap helper docstring drift. The dCDH bootstrap mixin documented the re-aggregation divisor as a "count of contributing groups" with the formula `W @ u_centered / n_groups_target`, but the live code (and the registry) use switching-cell denominators (`N_S`, joiner cells, leaver cells) per Theorem 3 of AER 2020. The numbers were correct; only the docstring drifted. Renamed the public mixin parameter from `n_groups_overall` to `divisor_overall` and updated the docstring to describe the divisor as the switching-cell count from the Theorem 3 weighting formula. The joiners/leavers tuple description in the docstring also says `(u_centered, divisor, original_effect)` instead of `(u_centered, n_groups, original_effect)` to match. The fit() call site at line 1041 passes `divisor_overall=N_S` under the new name. P3 — `docs/llms-full.txt` line 238 said the dCDH `cluster` parameter was "Reserved for future use," which is stale after Round 7's NotImplementedError gate. Updated to "Phase 1: must be None; non-None raises NotImplementedError" to match REGISTRY.md and the constructor contract. No behavior change; both fixes are purely textual / parameter naming. 107 tests still pass; black, ruff clean. Files modified: - diff_diff/chaisemartin_dhaultfoeuille_bootstrap.py (rename n_groups_overall -> divisor_overall, expanded docstring describing the cell-count divisor and the formula `W @ u_centered / divisor`) - diff_diff/chaisemartin_dhaultfoeuille.py (call site at line 1041 updates the keyword argument to match) - docs/llms-full.txt (cluster comment matches the Phase 1 contract) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

igerber · 2026-04-12T00:51:18Z

/ai-review

github-actions · 2026-04-12T01:02:58Z

🔁 AI review rerun (requested by @igerber)

Head SHA: 8cafae13ad3d7d1516c7a198b9331730dd5414fb

Overall Assessment

⚠️ Needs changes — one unmitigated P1 finding.

Executive Summary

Previous blocker resolved: cluster is now rejected in both __init__ and set_params, matching the Phase 1 contract.
Affected methodology here is the dCDH DID_M / placebo surface plus the AER 2020 TWFE decomposition diagnostic, with analytical variance mapped to the intertemporal DiD working paper. (aeaweb.org)
Severity P1 [Newly identified]: fitted twfe_* outputs are computed on the pre-filter cell sample, while overall_att, results.groups, and inference use the post-filter sample. When interior-gap or multi-switch filters fire, the diagnostic is not actually from the “same fit” / “same data” as documented.
That deviation is not documented in docs/methodology/REGISTRY.md and is not tracked in TODO.md, so it remains unmitigated.
Static review only: this runner does not have numpy, pandas, or pytest, so I could not execute the added tests.

Methodology

Affected methodology: the dCDH DID_M / placebo estimator family and the TWFE decomposition diagnostic introduced by de Chaisemartin and D’Haultfoeuille, plus the dynamic-paper variance mapping. (aeaweb.org)

Severity P1 [Newly identified]. Impact: fit() computes twfe_diagnostic_payload before ragged-panel validation and before drop_larger_lower, but then stores those pre-filter TWFE results next to post-filter dCDH outputs. If any interior-gap group is dropped or any multi-switch group is removed, results.twfe_weights can include groups absent from results.groups, and twfe_beta_fe / twfe_fraction_negative / twfe_sigma_fe no longer describe the sample underlying overall_att. That is an undocumented methodology / semantic-contract deviation, and it can silently mislead users into attributing TWFE-vs-dCDH differences to weighting rather than to a different estimation sample. Concrete fix: recompute the fitted TWFE diagnostic on the filtered cell used by DID_M, or explicitly split the API into documented pre-filter vs post-filter diagnostics and warn whenever the samples differ. References: diff_diff/chaisemartin_dhaultfoeuille.py:L560, diff_diff/chaisemartin_dhaultfoeuille.py:L629, diff_diff/chaisemartin_dhaultfoeuille.py:L673, diff_diff/chaisemartin_dhaultfoeuille.py:L1092, diff_diff/chaisemartin_dhaultfoeuille.py:L1127, diff_diff/chaisemartin_dhaultfoeuille.py:L273, diff_diff/chaisemartin_dhaultfoeuille_results.py:L208, docs/api/chaisemartin_dhaultfoeuille.rst:L40, docs/choosing_estimator.rst:L238, docs/methodology/REGISTRY.md:L563.

Beyond that sample-contract issue, I did not find another unmitigated methodology defect in the changed DID_M, placebo, or cohort-recentered inference paths.

Code Quality

No findings. I did not see a new inline-inference or partial-NaN regression; the changed estimator path routes inference through diff_diff/utils.py:L152.

Performance

No separate findings from static review. The blocker above also does unnecessary TWFE work on rows later dropped, but the correctness/contract problem is the part that needs action.

Maintainability

No separate findings.

Tech Debt

No separate deferrable finding. The P1 above is not tracked under the TODO.md “Tech Debt from Code Reviews” section, so it is not mitigated by project deferment rules.

Security

No findings.

Documentation/Tests

Severity P2. Impact: the current TWFE tests only assert fitted-vs-standalone equality on single-switch data where no filters fire, and one test comment explicitly codifies the pre-filter contract. That means CI will not catch the sample-mismatch bug in the cases that actually matter: dropped crossers and interior-gap groups. The user-facing docs also currently reinforce the wrong same-fit / same-data story. Concrete fix: add one regression test with a dropped multi-switch group and one with a dropped interior-gap group, asserting either post-filter TWFE parity with the fitted sample or an explicit warning + documented pre-filter contract; update the same-fit / same-data wording accordingly. References: tests/test_chaisemartin_dhaultfoeuille.py:L382, tests/test_chaisemartin_dhaultfoeuille.py:L528, tests/test_chaisemartin_dhaultfoeuille.py:L1204, tests/test_chaisemartin_dhaultfoeuille.py:L1222, docs/api/chaisemartin_dhaultfoeuille.rst:L40, docs/choosing_estimator.rst:L238.

Path to Approval

Make the fitted twfe_* results use the same filtered cell sample as overall_att and results.groups, or explicitly split pre-filter vs post-filter diagnostics and warn when they differ.
Add regression coverage for both filter paths that change the estimation sample: one interior-gap drop case and one drop_larger_lower=True multi-switch-drop case.
Update the dCDH docstring, results docstring, API docs, and estimator-selection docs so their “same fit” / “same data” wording matches the implemented contract.

Documents the dCDH TWFE diagnostic sample contract that Round 7's swap left implicit. The fitted results.twfe_* values are computed on the FULL pre-filter cell sample (matching the standalone twowayfeweights() function), NOT on the post-filter estimation sample used by overall_att / results.groups / inference fields. The existing user-facing wording said "TWFE on the same data" / "diagnostic from the same fit" — phrases that naturally read as "same data as overall_att" — which contradicted the post-Round-7 behavior. This commit: 1. Adds a new `**Note (TWFE diagnostic sample contract):**` block in REGISTRY.md enumerating all three sample-shaping filters (interior-gap, multi-switch, singleton-baseline) and explicitly carving singleton-baseline as variance-only (no fitted-vs-overall_att mismatch, so no warning). 2. Rewrites the `twfe_diagnostic` parameter docstring in chaisemartin_dhaultfoeuille.py to describe the pre-filter contract and the divergence warning. 3. Rewrites the twfe_weights / twfe_fraction_negative / twfe_sigma_fe / twfe_beta_fe field docstrings in the results dataclass to clarify they describe the FULL pre-filter cell sample, with a pointer to the REGISTRY contract Note. 4. Adds a `UserWarning` from `fit()` whenever the user requested the TWFE diagnostic AND any of the interior-gap or multi-switch filters dropped groups. The warning explains the divergence with explicit counts and points at REGISTRY for the rationale. The warning fires regardless of whether the diagnostic itself succeeded or hit the rank-deficient fallback (the plan-review correctly flagged that the `twfe_diagnostic_payload is not None` guard would swallow the rare rank-deficient + filtered-panel intersection — dropped that guard). 5. Updates docs/api/chaisemartin_dhaultfoeuille.rst and docs/choosing_estimator.rst to replace "from the same fit" with "computed on the data you pass in (pre-filter)". 6. Adds three regression tests in TestTwowayFeweightsHelper: - test_twfe_pre_filter_contract_with_interior_gap_drop: panel with a dropped interior-gap group, asserts fitted twfe_* matches standalone, estimation sample is smaller, and the divergence warning fires with the expected counts. - test_twfe_pre_filter_contract_with_multi_switch_drop: panel with an injected multi-switch crosser, similar assertions. - test_twfe_no_divergence_warning_on_clean_panel: negative test asserting NO divergence warning fires on a clean panel (hard-codes pattern="single_switch" to close a future footgun). 7. Fixes the stale "Step 5a guarantees..." comment at line 712 to "Step 5b guarantees..." (post-Round-7 the ragged-panel validation is Step 5b, not Step 5a). Independent cleanup; bundled because it's in the same file and the same topic. This resolution preserves Round 7's standalone-vs-fitted parity (both APIs use the pre-filter cell sample) and addresses Round 9's P1 about the documentation contract. Both reviewers' concerns are now satisfied: the standalone and fitted produce identical numbers on the same input, AND users see an explicit warning when filters make the fitted sample diverge from the dCDH estimation sample. Test counts: 107 -> 110 (three new sample-contract regression tests). Black, ruff clean. Files modified: - docs/methodology/REGISTRY.md (new TWFE sample contract Note enumerating all three filters) - diff_diff/chaisemartin_dhaultfoeuille.py (twfe_diagnostic param docstring, n_groups_dropped_interior_gap tracking, divergence warning at Step 6b, stale comment fix) - diff_diff/chaisemartin_dhaultfoeuille_results.py (twfe_weights / twfe_fraction_negative / twfe_sigma_fe / twfe_beta_fe field docstrings) - docs/api/chaisemartin_dhaultfoeuille.rst (wording fix) - docs/choosing_estimator.rst (wording fix) - tests/test_chaisemartin_dhaultfoeuille.py (3 new tests + 1 parity test comment update) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

igerber · 2026-04-12T01:27:43Z

/ai-review

github-actions · 2026-04-12T01:36:32Z

🔁 AI review rerun (requested by @igerber)

Head SHA: 63edb3d9041ef37a53b58a34b6de32539ecfb2bd

Overall Assessment

Needs changes. One unmitigated P1 finding.

Executive Summary

The previous TWFE sample-contract blocker appears resolved: the pre-filter TWFE behavior is now documented in the registry, surfaced with a fit-time warning, and covered by dedicated regression tests. See docs/methodology/REGISTRY.md:L563-L573, diff_diff/chaisemartin_dhaultfoeuille.py:L576-L735, tests/test_chaisemartin_dhaultfoeuille.py:L1234-L1356.
Severity P1: when n_bootstrap > 0, dCDH computes bootstrap percentile CIs and bootstrap p-values, but discards them for the main results surface and replaces them with safe_inference(effect, bootstrap_se) output. That makes the public inference surface a hybrid, not the requested multiplier-bootstrap inference.
The mismatch propagates into overall_*, joiners_*, leavers_*, event_study_effects[1], summary(), to_dataframe(), is_significant, and significance_stars. See diff_diff/chaisemartin_dhaultfoeuille.py:L1090-L1137, diff_diff/chaisemartin_dhaultfoeuille_results.py:L400-L407, diff_diff/chaisemartin_dhaultfoeuille_results.py:L476-L724.
This issue is unmitigated: I did not find a corresponding Note / Deviation entry for this hybrid inference contract in the dCDH registry section, and TODO.md does not track it. See docs/methodology/REGISTRY.md:L460-L595, TODO.md:L51-L110.
Static review only. This runner does not have numpy, pandas, or pytest, so I could not execute the added tests.

Methodology

Severity P1. Impact: the bootstrap helper correctly computes per-target bootstrap SE / CI / p-value triples for DID_M, DID_+, and DID_-, but fit() only copies the bootstrap SE into the top-level results and then recomputes p-values and confidence intervals with safe_inference(). That silently changes the inference method from percentile/bootstrap p-values to normal-theory p-values and CIs whenever n_bootstrap > 0. Because event_study_effects[1] mirrors the same top-level fields, the inconsistency also affects the forward-compatible event-study surface. This is not the pattern used by the library’s other bootstrap-enabled estimators, which do propagate bootstrap p-values and CIs to the public result object. Concrete fix: in the bootstrap branch, assign overall_p / overall_ci, joiners_p / joiners_ci, and leavers_p / leavers_ci from br; keep *_t_stat from safe_inference(...)[0] if desired. Refs: diff_diff/chaisemartin_dhaultfoeuille.py:L1090-L1137, diff_diff/chaisemartin_dhaultfoeuille_bootstrap.py:L175-L230, diff_diff/bootstrap_utils.py:L219-L291, diff_diff/imputation.py:L790-L805, diff_diff/two_stage.py:L778-L787, diff_diff/efficient_did.py:L1009-L1013.
Severity P3. Impact: the prior TWFE fitted-vs-estimation-sample mismatch is now explicitly documented as a Phase 1 contract, warned at fit time, mirrored in result docstrings, and regression-tested for both interior-gap and multi-switch filters. Under the review rules, that makes it informational rather than blocking. Concrete fix: none required. Refs: docs/methodology/REGISTRY.md:L563-L573, diff_diff/chaisemartin_dhaultfoeuille.py:L697-L735, diff_diff/chaisemartin_dhaultfoeuille_results.py:L208-L238, tests/test_chaisemartin_dhaultfoeuille.py:L1234-L1356.

Code Quality

No unmitigated findings beyond the bootstrap inference-surface issue above.

Performance

No separate performance findings from static review.

Maintainability

No separate maintainability findings from the changed code.

Tech Debt

No separate deferrable finding. The bootstrap inference-surface mismatch is not tracked in TODO.md, so it remains unmitigated. See TODO.md:L51-L110.

Security

No findings.

Documentation/Tests

Severity P2. Impact: the new bootstrap tests only verify that bootstrap_results is populated and that placebo bootstrap stays unavailable; they do not assert that the top-level public fields (overall_p_value, overall_conf_int, joiner/leaver analogues, event_study_effects[1]) or user-facing renderers (summary(), to_dataframe()) actually reflect bootstrap inference. That gap is why the P1 above would pass CI. Separately, the README currently describes the top-level overall_* fields as analytical, which conflicts with the runtime behavior once bootstrap is requested. Concrete fix: add a regression that compares the top-level inference fields and rendered summaries/tables against bootstrap_results when n_bootstrap > 0, and update the README / API docs / result docstrings to state exactly which fields switch to bootstrap inference. Refs: tests/test_chaisemartin_dhaultfoeuille.py:L867-L955, README.md:L1204-L1214, docs/api/chaisemartin_dhaultfoeuille.rst:L196-L206.

Path to Approval

In diff_diff/chaisemartin_dhaultfoeuille.py:L1108-L1122, propagate bootstrap p-values and CIs from br into the top-level overall_*, joiners_*, and leavers_* result fields instead of recomputing them with safe_inference().
Add a regression test that fits with n_bootstrap > 0 and asserts the public inference surface (overall_*, joiners_*, leavers_*, event_study_effects[1]) matches bootstrap_results.
Add user-facing coverage that summary() and to_dataframe() surface the corrected bootstrap inference, and update the dCDH docs so the semantics of the top-level fields are explicit when bootstrap is enabled.

Fixes a P1 where the dCDH bootstrap branch silently replaced the multiplier-bootstrap percentile p-values and CIs with normal-theory recomputations from the bootstrap SE. The bootstrap helper computes the per-target SE / CI / p-value triples correctly on the DCDHBootstrapResults object, but fit() was only copying the SE and then calling safe_inference() which returns normal-theory p-values and CIs. The user requested multiplier bootstrap inference and got a hybrid (bootstrap SE + normal-theory p/CI) — the public surface fields, event_study_effects[1], summary(), to_dataframe(), is_significant, and significance_stars all reflected the normal-theory recomputations instead of the bootstrap inference. The fix: in the fit() bootstrap branch, propagate br.overall_p_value and br.overall_ci directly to the top-level overall_p / overall_ci (and joiners/leavers analogues). Keep the t-stat from safe_inference()[0] since percentile bootstrap doesn't define an alternative t-stat semantic, and that satisfies the project anti-pattern rule (never compute t = effect/se inline). Library precedent: imputation.py:790-805, two_stage.py:778-787, and efficient_did.py:1009-1013 all propagate bootstrap p/CI to the public surface while keeping a SE-derived t-stat. dCDH was the only modern bootstrap-enabled estimator that didn't follow this pattern. Documentation updates: - New `**Note (bootstrap inference surface):**` block in REGISTRY.md documenting the propagation contract, the rationale for the SE-based t-stat, and the placebo carve-out (placebo bootstrap remains deferred to Phase 2). - Inference-method switch paragraph added to the ChaisemartinDHaultfoeuilleResults class docstring `Notes` section. - README.md row updated to clarify "cohort-recentered analytical SE by default; multiplier-bootstrap percentile inference when n_bootstrap > 0". - API rst Multiplier bootstrap example now shows that results.overall_p_value and results.overall_conf_int reflect the bootstrap inference (not just bootstrap_results.overall_*). Regression test added: test_bootstrap_p_value_and_ci_propagated_to_top_level asserts results.overall_p_value == bootstrap_results.overall_p_value, overall_conf_int == bootstrap_results.overall_ci, the joiners/leavers analogues, event_study_effects[1] reflects bootstrap, the t-stat is the SE-derived value, summary() doesn't crash, and to_dataframe() returns the bootstrap-derived numbers. This pins the contract so the silent inference-method swap can't regress. Test counts: 110 -> 111 (one new bootstrap propagation regression test). The existing TestBootstrap tests at lines 855-955 only asserted that bootstrap_results was populated and SE was finite, which is why the silent swap passed CI for nine commits. Files modified: - diff_diff/chaisemartin_dhaultfoeuille.py (fit() bootstrap branch propagates p/CI from br instead of recomputing via safe_inference) - diff_diff/chaisemartin_dhaultfoeuille_results.py (class docstring `Notes` section gains the inference-method switch paragraph) - docs/methodology/REGISTRY.md (new bootstrap inference surface Note) - README.md (field-description row clarification) - docs/api/chaisemartin_dhaultfoeuille.rst (Multiplier bootstrap example reflects the new propagation) - tests/test_chaisemartin_dhaultfoeuille.py (test_bootstrap_p_value_and_ci_propagated_to_top_level) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Two fixes: P1 — Within-cell-varying treatment now raises ValueError instead of silently rounding to majority. Phase 1 dCDH requires binary treatment to be constant within each (group, time) cell; fractional d_gt values (from individual-level data where some units in a cell are treated and others are not) indicate a fuzzy design that Phase 1 does not support. The previous behavior (UserWarning + majority-round) silently mutated switcher/control membership before Theorem 3 arithmetic, changing the estimand without the user's knowledge. The ValueError lists the affected cells and points users at pre-aggregation. The "binary fuzzy designs" claim has been removed from README, CHANGELOG, REGISTRY, and choosing_estimator.rst. Both fit() and twowayfeweights() share the same _validate_and_aggregate_to_cells() rejection via the existing shared helper. Tests: - test_twowayfeweights_warns_on_within_cell_rounding renamed to test_twowayfeweights_rejects_within_cell_varying_treatment (now asserts ValueError instead of UserWarning) - test_fit_rejects_within_cell_varying_treatment added (same panel via the fit() entry point) CI fix — test_recovery_joiners_only_n200 was failing on arm64 with seed 43 (assert 1.5 <= 1.276 — CI coverage assertion failed). Changed to a point-estimate proximity assertion (abs(overall_att - 1.5) < 0.5) which is stable across architectures and seeds. CI coverage checks are inherently stochastic and require many replications to be reliable; point-estimate proximity is the right assertion for a single-seed large-N recovery test. P3 — Fixed stale comment at line 1039 that said "in Phase 1 we approximate [placebo SE] using the same plug-in formula" when the actual behavior is intentionally NaN. Updated to match the warning text and the REGISTRY placebo SE Note. Test counts: 111 -> 112. Black, ruff clean. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

igerber · 2026-04-12T10:16:05Z

/ai-review

github-actions · 2026-04-12T10:25:27Z

🔁 AI review rerun (requested by @igerber)

Head SHA: a8b161c5fa1e713c331ec7c77690336c9df40b1e

Overall Assessment

⚠️ Needs changes.

Highest unmitigated severity is P1. This re-review is based on static inspection of the changed files; I could not run the new dCDH tests on this runner because pytest, numpy, and pandas are not installed.

Executive Summary

The previous methodology blocker is resolved: the shared cell-aggregation helper now rejects fractional within-cell treatment instead of rounding, and the registry now documents the Phase 1 constant-within-cell contract. Refs: diff_diff/chaisemartin_dhaultfoeuille.py:L201-L216, docs/methodology/REGISTRY.md:L470-L475
The previous bootstrap-surface and summary-note issues also appear resolved: bootstrap percentile p-values/CIs now propagate to the top-level result fields, and summary() prints a bootstrap-specific note when bootstrap inference is active. Refs: diff_diff/chaisemartin_dhaultfoeuille.py:L1106-L1141, diff_diff/chaisemartin_dhaultfoeuille_results.py:L523-L535, tests/test_chaisemartin_dhaultfoeuille.py:L974-L1035
One P1 remains on the new TWFE diagnostic surface: fit() catches all diagnostic exceptions and converts them into a warning + skipped diagnostic, so rank_deficient_action="error" does not actually raise on the fitted-estimator path. Refs: diff_diff/chaisemartin_dhaultfoeuille.py:L303-L306, diff_diff/chaisemartin_dhaultfoeuille.py:L581-L597, diff_diff/linalg.py:L460-L464
The standalone twowayfeweights() helper still passes rank_deficient_action straight through, so the standalone and fitted APIs now diverge on the same rank-deficient design. Refs: diff_diff/chaisemartin_dhaultfoeuille.py:L2142-L2191, diff_diff/chaisemartin_dhaultfoeuille.py:L581-L597
I did not find other unmitigated P0/P1 issues in the changed dCDH methodology, inference, NaN handling, or documented Phase 1 deviations.

Methodology

Severity P3 informational. Impact: I did not find an unmitigated source-material mismatch in the changed estimator math. The previously raised mixed-cell-treatment issue is fixed, and the remaining substantive Phase 1 departures I checked, including placebo SE deferral and the balanced-baseline/no-interior-gap panel restriction, are explicitly documented in REGISTRY.md, so they are not defects under the review rubric. At the paper level, the cited source material is consistent with the feature’s intended scope: the AER paper motivates an alternative to potentially negatively weighted TWFE, and the dynamic companion paper explicitly studies non-absorbing, lagged-treatment settings. (aeaweb.org) Concrete fix: None. Refs: diff_diff/chaisemartin_dhaultfoeuille.py:L201-L216, docs/methodology/REGISTRY.md:L470-L475, docs/methodology/REGISTRY.md:L555-L575

Code Quality

Severity P1. Impact: fit() wraps _compute_twfe_diagnostic() in a blanket except Exception and always downgrades failures to a warning, then returns with results.twfe_* left unset. But _compute_twfe_diagnostic() passes rank_deficient_action through to solve_ols(), whose "error" contract is to raise ValueError. The result is a broken public-parameter contract on new code: a caller requesting strict failure on a rank-deficient TWFE diagnostic design can get a successful fit with the diagnostic silently omitted instead. Concrete fix: in fit(), re-raise rank-deficiency failures when self.rank_deficient_action == "error", or narrow the catch so only genuinely non-fatal diagnostic failures are downgraded to warnings. Refs: diff_diff/chaisemartin_dhaultfoeuille.py:L581-L597, diff_diff/chaisemartin_dhaultfoeuille.py:L1159-L1210, diff_diff/chaisemartin_dhaultfoeuille.py:L2007-L2057, diff_diff/chaisemartin_dhaultfoeuille.py:L2142-L2191, diff_diff/linalg.py:L460-L464

Performance

No findings from static review of the changed files.

Maintainability

Severity P3. Impact: two code-level docstrings still describe the old “warning + majority rounding” behavior for fractional within-cell treatment, while the implementation now raises ValueError. Runtime behavior is correct, but the stale docs contradict both the code and the methodology registry and increase the risk of future regressions. Concrete fix: update both docstrings to say fractional d_gt is rejected and fuzzy DiD is out of scope for Phase 1. Refs: diff_diff/chaisemartin_dhaultfoeuille.py:L121-L130, diff_diff/chaisemartin_dhaultfoeuille.py:L479-L483, docs/methodology/REGISTRY.md:L470-L475

Tech Debt

No separate finding. I did not find a dCDH-specific mitigation entry under TODO.md:L51-L97 that would downgrade the P1 above.

Security

No findings.

Documentation/Tests

Severity P3. Impact: the changed dCDH tests pin adjacent public contracts like cluster gating and bootstrap-surface propagation, but I did not find a corresponding regression for fit(..., rank_deficient_action="error") on the fitted TWFE path, which is why the current P1 was able to slip through. Concrete fix: add a dCDH regression test with a deliberately rank-deficient FE design asserting that fit() raises under "error" and remains non-fatal only under "warn"/"silent". Refs: tests/test_chaisemartin_dhaultfoeuille.py:L330-L371, tests/test_chaisemartin_dhaultfoeuille.py:L974-L1035
I could not execute the dCDH tests on this runner because pytest, numpy, and pandas are unavailable.

Path to Approval

Honor rank_deficient_action="error" in ChaisemartinDHaultfoeuille.fit() by re-raising rank-deficiency failures from _compute_twfe_diagnostic() instead of downgrading them to warning + skip.
Add a regression test that constructs a rank-deficient TWFE diagnostic design and asserts fit(..., twfe_diagnostic=True, rank_deficient_action="error") raises, while "warn"/"silent" preserve the intended non-fatal behavior.
Update the stale dCDH docstrings so they match the current Phase 1 contract: fractional within-cell treatment raises ValueError; it is not rounded.

P1: the blanket except Exception around the TWFE diagnostic call in fit() Step 5a swallowed ALL exceptions, including the ValueError that solve_ols raises when rank_deficient_action="error". This broke the public-parameter contract: a user requesting strict failure on a rank-deficient TWFE design would get a successful fit with the diagnostic silently omitted instead. The standalone twowayfeweights() already honored the parameter correctly; only the fitted path was broken. Fix: in the except block, re-raise ValueError when rank_deficient_action=="error" so the user's strict-failure request is honored. Other exceptions (genuinely non-fatal diagnostic failures) are still downgraded to a warning + skipped diagnostic. P3: fixed two stale docstrings that still described the old "warning + majority rounding" behavior for fractional within-cell treatment (the _validate_and_aggregate_to_cells step 5 docstring and the fit() group parameter docstring). Both now correctly describe the Phase 1 ValueError rejection. Regression test: test_rank_deficient_action_error_raises_on_fitted_twfe in TestForwardCompatGates. Uses a minimal 1-group 2-period panel (2 cells < 3 FE columns) to trigger the underdetermined-system ValueError. Asserts rank_deficient_action="error" raises, and rank_deficient_action="warn" does NOT raise the TWFE-specific error. Test counts: 112 -> 113. Black, ruff clean. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

igerber · 2026-04-12T10:30:54Z

/ai-review

github-actions · 2026-04-12T10:50:37Z

🔁 AI review rerun (requested by @igerber)

Head SHA: b5e184781eeffd3391783549fee8ec3b23abcf90

Overall Assessment
⚠️ Needs changes.

Highest unmitigated severity is P1. This re-review resolves the prior fitted-TWFE rank_deficient_action="error" blocker, but two P1 issues remain: one methodology-surface deviation that is not documented as such, and one missing empty-input guard on the new standalone diagnostic.

Executive Summary

The previous rerun blocker is fixed: fit() now re-raises TWFE diagnostic ValueErrors when rank_deficient_action="error", and the new regression test pins that behavior. Refs: diff_diff/chaisemartin_dhaultfoeuille.py:L582-L599, tests/test_chaisemartin_dhaultfoeuille.py:L373-L451
I did not find an unmitigated P0 in the main DID_M estimator, NaN propagation, placebo inference deferral, or bootstrap percentile propagation. Refs: diff_diff/chaisemartin_dhaultfoeuille.py:L990-L1150, diff_diff/bootstrap_utils.py:L277-L287
P1: for individual-level inputs with uneven (g,t) cell sizes, the implementation intentionally enforces equal cell weighting after aggregation, but that divergence from did_multiplegt_dyn is not surfaced as a labeled deviation even though the public docs still say the estimator matches R at l=1. Refs: diff_diff/chaisemartin_dhaultfoeuille.py:L469-L476, diff_diff/chaisemartin_dhaultfoeuille.py:L194-L216, diff_diff/chaisemartin_dhaultfoeuille.py:L1407-L1430, tests/test_chaisemartin_dhaultfoeuille.py:L679-L759, docs/methodology/REGISTRY.md:L492-L493, docs/api/chaisemartin_dhaultfoeuille.rst:L48-L49. (rdocumentation.org)
P1: the new public twowayfeweights() helper lacks empty-input handling and can fall into a low-level design-builder failure instead of a clear validation error. Refs: diff_diff/chaisemartin_dhaultfoeuille.py:L1959-L1967, diff_diff/chaisemartin_dhaultfoeuille.py:L2151-L2200
Other substantive Phase 1 departures I checked are documented in REGISTRY.md and are therefore informational rather than blocking: period-based stable controls vs R’s same-baseline controls, placebo SE deferral, and the balanced-baseline/interior-gap panel restriction. Refs: docs/methodology/REGISTRY.md:L555-L575. (aeaweb.org)

Methodology

Severity P1. Impact: On supported individual-level input, the estimator now aggregates to cell means and then gives each (g,t) cell equal weight; test_cell_count_weighting_unbalanced_input explicitly locks that in. The official did_multiplegt_dyn docs say that when the data are more disaggregated than group × time, the R command aggregates internally and, absent an explicit weight variable, weights estimation by the number of observations in each cell. The registry explains the equal-cell choice, but not in a labeled **Note (deviation from R DIDmultiplegtDYN):** block, while the public docs still advertise R matching at l=1. Under the review rubric, that is an undocumented methodology/weighting mismatch on a supported input type. Concrete fix: either align raw-microdata aggregation with R’s cell-size weighting, or promote the current equal-cell rule to an explicit **Note (deviation from R DIDmultiplegtDYN):** in the registry and narrow the parity language to cell-aggregated / equal-cell-size inputs only. Refs: diff_diff/chaisemartin_dhaultfoeuille.py:L469-L476, diff_diff/chaisemartin_dhaultfoeuille.py:L194-L216, diff_diff/chaisemartin_dhaultfoeuille.py:L1407-L1430, tests/test_chaisemartin_dhaultfoeuille.py:L679-L759, docs/methodology/REGISTRY.md:L492-L493, docs/methodology/REGISTRY.md:L578-L579, docs/api/chaisemartin_dhaultfoeuille.rst:L20-L32, docs/api/chaisemartin_dhaultfoeuille.rst:L48-L49. (rdocumentation.org)
Severity P3 informational. Impact: Aside from the weighting-surface issue above, I did not find an unmitigated source-material mismatch in the core Phase 1 estimator. The registry’s documented departures cover the period-based control-set choice, placebo inference deferral, and ragged-panel restriction; the TWFE diagnostic surface is also directionally consistent with the AER paper’s negative-weight motivation and the authors’ twowayfeweights summary-measure tooling. Concrete fix: None. Refs: docs/methodology/REGISTRY.md:L555-L575, diff_diff/chaisemartin_dhaultfoeuille.py:L609-L683, diff_diff/chaisemartin_dhaultfoeuille.py:L1048-L1075, diff_diff/chaisemartin_dhaultfoeuille.py:L1723-L1944. (aeaweb.org)

Code Quality

Severity P1. Impact: twowayfeweights() validates columns and aggregates to cells, but it does not guard the empty-frame case before calling _compute_twfe_diagnostic(). Statically, an empty cell frame drives _build_group_time_design() to n_groups == 0, n_times == 0, and therefore n_cols = -1, which can surface as a low-level negative-dimensions error instead of the clear period-count validation that fit() emits later. This is exactly the missing empty-result handling the review checklist calls out for new code paths. Concrete fix: reject cell.empty (or n_groups == 0 or n_times == 0) in twowayfeweights() / _compute_twfe_diagnostic() with a clear ValueError, and keep the message aligned with fit()’s explicit “at least 2 distinct time periods” contract. Refs: diff_diff/chaisemartin_dhaultfoeuille.py:L623-L628, diff_diff/chaisemartin_dhaultfoeuille.py:L1959-L1967, diff_diff/chaisemartin_dhaultfoeuille.py:L2151-L2200

Performance
No findings from static review of the changed files.

Maintainability
No separate findings beyond the methodology-surface documentation issue above.

Tech Debt
No separate findings. Neither blocking issue is tracked under TODO.md:L51-L97, so neither is mitigated by the deferred-work policy.

Security
No findings.

Documentation/Tests

Severity P3. Impact: the shared-validation tests for twowayfeweights() cover NaN treatment, NaN outcome, non-binary treatment, and within-cell-varying treatment, but not the empty-frame case, which is why the P1 above is currently unpinned. Concrete fix: add a regression that passes an empty DataFrame with group/period/treatment/outcome columns to twowayfeweights() and asserts a clear ValueError. Refs: tests/test_chaisemartin_dhaultfoeuille.py:L889-L900, tests/test_chaisemartin_dhaultfoeuille.py:L1558-L1647
I could not execute the new dCDH tests on this runner because numpy, pandas, and pytest are not installed.

Path to Approval

Resolve the raw-microdata weighting contract. Either:
- change internal aggregation so uneven group × time cell sizes are weighted the way did_multiplegt_dyn documents for more-disaggregated input, or
- if the equal-cell rule is intentional, add a labeled **Note (deviation from R DIDmultiplegtDYN):** to docs/methodology/REGISTRY.md, narrow the “matches R at l=1” language in docs/api/chaisemartin_dhaultfoeuille.rst, and add a parity/regression test that exercises unequal cell sizes explicitly.
Add explicit empty-input validation to twowayfeweights() (or _compute_twfe_diagnostic()), so an empty DataFrame with the required columns raises a clear ValueError, and add a regression test for that case.

Two P1 fixes from CI review: P1 — Cell-count weighting deviation from R now has a labeled `**Note (deviation from R DIDmultiplegtDYN):**` in REGISTRY.md. The equal-cell weighting is paper-literal (AER 2020 Theorem 3 defines N_{a,b,t} as cell counts), but R DIDmultiplegtDYN weights by within-cell observation count on individual-level input. The two agree on cell-aggregated input where every cell has the same number of observations (which is how the parity tests are built). Narrowed the API rst parity language to "on cell-aggregated input." P1 — twowayfeweights() / _build_group_time_design() now guards against empty cell DataFrames with a clear ValueError instead of falling into a low-level negative-dimensions error from n_groups=0. Added test_twowayfeweights_rejects_empty_input. Test counts: 113 -> 114. Black, ruff clean. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

igerber · 2026-04-12T10:56:01Z

/ai-review

github-actions · 2026-04-12T11:20:56Z

🔁 AI review rerun (requested by @igerber)

Head SHA: 81871d4433873b6ffd539b60f95a4837e49753b9

Overall Assessment

⚠️ Needs changes.

Highest unmitigated severity is P1: the new dCDH estimator path still lacks a guard or documented contract for globally missing time periods, so it can silently reinterpret calendar gaps as adjacent periods.

Executive Summary

The two prior rerun blockers appear resolved: unequal cell-size weighting is now explicitly documented as a labeled deviation from DIDmultiplegtDYN, and twowayfeweights() now rejects empty input with regression coverage. Refs: docs/methodology/REGISTRY.md:L492, docs/api/chaisemartin_dhaultfoeuille.rst:L48, diff_diff/chaisemartin_dhaultfoeuille.py:L1959, tests/test_chaisemartin_dhaultfoeuille.py:L679, tests/test_chaisemartin_dhaultfoeuille.py:L1601. (rdocumentation.org)
Severity P1 [Newly identified]: fit() validates per-group missing-baseline and interior-gap cases, but not a period missing for all groups. Because the estimator then walks adjacent observed periods, a panel like 2000, 2001, 2003 is treated as if 2003 were the immediate successor to 2001. Refs: diff_diff/chaisemartin_dhaultfoeuille.py:L623, diff_diff/chaisemartin_dhaultfoeuille.py:L652, diff_diff/chaisemartin_dhaultfoeuille.py:L1390, diff_diff/chaisemartin_dhaultfoeuille.py:L1517, docs/methodology/REGISTRY.md:L477, docs/methodology/REGISTRY.md:L575. (rdocumentation.org)
I did not find an unmitigated P0 in NaN propagation, placebo inference deferral, bootstrap p/CI propagation, or the TWFE pre-filter sample contract. Refs: diff_diff/chaisemartin_dhaultfoeuille.py:L1000, diff_diff/chaisemartin_dhaultfoeuille.py:L1048, diff_diff/chaisemartin_dhaultfoeuille.py:L1136, tests/test_chaisemartin_dhaultfoeuille.py:L1002, tests/test_chaisemartin_dhaultfoeuille.py:L1054, tests/test_chaisemartin_dhaultfoeuille.py:L1399.
Aside from the global-gap issue, the remaining substantive Phase 1 departures I checked are labeled notes/deviations in REGISTRY.md, so they are informational rather than blocking under this rubric. Refs: docs/methodology/REGISTRY.md:L553, docs/methodology/REGISTRY.md:L555, docs/methodology/REGISTRY.md:L573, docs/methodology/REGISTRY.md:L575. (aeaweb.org)
I could not execute the tests in this runner because numpy, pandas, and pytest are not installed.

Methodology

Severity P1 [Newly identified]. Impact: the official did_multiplegt_dyn() docs assume an evenly spaced time variable and explicitly call out globally missing years as a special case. Python Phase 1 only checks per-group baseline/interior gaps, then computes Theorem 3/4 differences on adjacent observed columns. On a panel where every group is observed in 2000, 2001, 2003, DID_M and DID_M^pl will silently use 2003-2001 as if it were a one-period difference. That is a missing assumption check on the estimation path, and the current dCDH registry/API docs do not surface it. Refs: diff_diff/chaisemartin_dhaultfoeuille.py:L623, diff_diff/chaisemartin_dhaultfoeuille.py:L652, diff_diff/chaisemartin_dhaultfoeuille.py:L1390, diff_diff/chaisemartin_dhaultfoeuille.py:L1517, docs/methodology/REGISTRY.md:L477, docs/methodology/REGISTRY.md:L575. (rdocumentation.org)
Concrete fix: add a global-spacing validation step in fit() for numeric/datetime time inputs and raise a clear ValueError when global periods are nonconsecutive; for arbitrary sortable labels, require users to remap to consecutive period indices and document that requirement explicitly in the registry/API docs.
Severity P3. Impact: the previous microdata-weighting blocker is resolved. REGISTRY now labels equal-cell weighting on uneven (group, time) cells as a deviation from DIDmultiplegtDYN, the API docs narrow the R-parity claim accordingly, and the regression test pins the intended equal-cell behavior. Refs: docs/methodology/REGISTRY.md:L492, docs/api/chaisemartin_dhaultfoeuille.rst:L48, tests/test_chaisemartin_dhaultfoeuille.py:L679. (rdocumentation.org)
Concrete fix: None.
Severity P3. Impact: aside from the global-gap issue, the changed estimator’s Phase 1 surface is aligned with the cited method family: AER 2020 is the TWFE negative-weight motivation, and the companion paper is the non-absorbing / intertemporal treatment extension. The remaining Phase 1 departures I checked are all labeled Note / Note (deviation from R ...) entries in the registry, so they are informational rather than defects. Refs: docs/methodology/REGISTRY.md:L553, docs/methodology/REGISTRY.md:L555, docs/methodology/REGISTRY.md:L573, docs/methodology/REGISTRY.md:L575. (aeaweb.org)
Concrete fix: None.

Code Quality

Severity P3. Impact: the prior empty-input blocker on the standalone TWFE helper appears fixed. _build_group_time_design() now rejects an empty cell frame up front, and the public regression test covers twowayfeweights() on that path. Refs: diff_diff/chaisemartin_dhaultfoeuille.py:L1959, tests/test_chaisemartin_dhaultfoeuille.py:L1601.
Concrete fix: None.

Performance

No findings.

Maintainability

No findings.

Tech Debt

Severity P3. Impact: the P1 above is not tracked under the code-review tech-debt table in TODO.md, so the deferred-work policy does not mitigate it. Ref: TODO.md:L51.
Concrete fix: address the issue in this PR; TODO tracking would not downgrade a P1 missing-assumption-check problem.

Security

No findings.

Documentation/Tests

Severity P3. Impact: the new tests cover missing baseline, interior-gap drops, terminal missingness, cell-count weighting, placebo-bootstrap deferral, and the TWFE helper’s empty-input regression, but they do not pin the all-groups global-gap case that drives the P1 above. Refs: tests/test_chaisemartin_dhaultfoeuille.py:L589, tests/test_chaisemartin_dhaultfoeuille.py:L608, tests/test_chaisemartin_dhaultfoeuille.py:L636, tests/test_chaisemartin_dhaultfoeuille.py:L679, tests/test_chaisemartin_dhaultfoeuille.py:L1002, tests/test_chaisemartin_dhaultfoeuille.py:L1399, tests/test_chaisemartin_dhaultfoeuille.py:L1601.
Concrete fix: add a regression with a shared global gap, e.g. periods 0, 1, 3, and assert the new error/warning contract.
Review limitation: I could not execute the new tests in this runner because numpy, pandas, and pytest are not installed.

Path to Approval

Add fit() validation that rejects or explicitly warns on nonconsecutive global periods for numeric/datetime time inputs, instead of silently treating adjacent observed periods as adjacent calendar periods.
Update the dCDH docs to state the even-spacing requirement clearly, alongside the existing ragged-panel note, so users can distinguish supported terminal missingness from unsupported globally missing periods.
Add a regression test that constructs a panel with a shared global gap and asserts the chosen ValueError/warning behavior.

P1: the estimator operates on sorted period indices (matching the AER 2020 paper and R DIDmultiplegtDYN) but this was not documented. A panel with periods [2000, 2001, 2003] treats 2003 as the immediate successor of 2001 without warning. Added a **Period-index semantics** assumption check bullet in REGISTRY.md explaining that the estimator does not validate or impute globally missing calendar periods, and advising users with calendar gaps to impute missing periods before fitting if they want period-over-period differences rather than gap-spanning differences. Regression test: test_global_period_gap_treated_as_adjacent constructs a 4-group panel with periods [0, 1, 3] (global gap at period 2) and asserts the fit completes with all 4 groups, a finite point estimate, and populated per-period effects including the gap-spanning transition. Test counts: 114 -> 115. Black, ruff clean. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

igerber · 2026-04-12T11:23:49Z

/ai-review

github-actions · 2026-04-12T11:39:46Z

🔁 AI review rerun (requested by @igerber)

Head SHA: 8956d138e5ac219b6b14e6830e5d6ec503480e5c

Overall Assessment

⛔ Blocker. Highest unmitigated severity is P0.

Executive Summary

P0 [Newly identified]: the shared cell-construction helper does not reject missing group or time, so malformed rows can disappear before the dCDH cell table is built, silently changing DID_M, DID_+, DID_-, DID_M^pl, the analytical variance inputs, and the standalone TWFE diagnostic. References: diff_diff/chaisemartin_dhaultfoeuille.py:L94-L220, diff_diff/chaisemartin_dhaultfoeuille.py:L565-L571, diff_diff/chaisemartin_dhaultfoeuille.py:L2189-L2200, docs/methodology/REGISTRY.md:L590-L596.
The prior global-period-gap blocker looks mitigated for re-review purposes: sorted-period semantics are now explicitly documented and regression-tested, so under this rubric that behavior is a documented contract rather than an undocumented methodology mismatch. References: docs/methodology/REGISTRY.md:L477-L478, tests/test_chaisemartin_dhaultfoeuille.py:L679-L731.
The other substantive Phase 1 departures I spot-checked are labeled Note / Note (deviation from R ...) entries in the registry, so they are informational rather than blocking under this rubric. References: docs/methodology/REGISTRY.md:L493-L576.
I did not find new inline t_stat = effect / se anti-patterns or partial NaN-guard regressions in the inference path; the modified analytical and bootstrap-adjusted surfaces route through safe_inference() or explicit NaN placeholders. References: diff_diff/chaisemartin_dhaultfoeuille.py:L1016-L1055, diff_diff/chaisemartin_dhaultfoeuille.py:L1115-L1150.
I could not execute the tests in this runner because numpy, pandas, and pytest are not installed.

Methodology

Severity P0 [Newly identified]. Impact: affected methods are the Phase 1 dCDH estimation path in ChaisemartinDHaultfoeuille.fit() (DID_M, DID_+, DID_-, DID_M^pl, and the cohort-recentered SE) plus the standalone twowayfeweights() helper. _validate_and_aggregate_to_cells() validates treatment and outcome, but not group or time, and then constructs the (g,t) cells consumed by both public entry points. That means rows with missing keys can fall out of the estimation sample before any estimator logic runs, silently changing the cell counts behind the Theorem 3 weighting, the placebo path, the influence-function inputs, and the Theorem 1 TWFE diagnostic, despite the new registry/changelog promising “no silent failures.” References: diff_diff/chaisemartin_dhaultfoeuille.py:L153-L199, diff_diff/chaisemartin_dhaultfoeuille.py:L565-L571, diff_diff/chaisemartin_dhaultfoeuille.py:L2189-L2200, docs/methodology/REGISTRY.md:L470-L478, docs/methodology/REGISTRY.md:L590-L596, CHANGELOG.md:L17-L18. Concrete fix: add explicit non-missing validation for group and time in _validate_and_aggregate_to_cells() before the groupby, with ValueErrors parallel to the existing treatment/outcome guards.
Severity P3. Impact: the previously raised global-calendar-gap concern is now documented as part of the estimator contract and pinned by regression coverage, so I do not count it as an unmitigated defect in this re-review. References: docs/methodology/REGISTRY.md:L477-L478, tests/test_chaisemartin_dhaultfoeuille.py:L679-L731. Concrete fix: None.

Code Quality

No additional findings beyond the shared-validation defect above.

Performance

No findings.

Maintainability

No findings.

Tech Debt

Severity P3. Impact: TODO.md does not track the missing-key drop, and even if it did, this rubric does not allow TODO tracking to mitigate a P0 silent-correctness bug. References: TODO.md:L51-L77. Concrete fix: address this in the PR rather than deferring it.

Security

No findings.

Documentation/Tests

Severity P2. Impact: the new registry note on global calendar gaps recommends inserting rows with NaN outcomes and letting cell aggregation handle them, but the same implementation explicitly raises on any NaN outcome. That documented workaround is therefore not currently usable. References: docs/methodology/REGISTRY.md:L472-L478, diff_diff/chaisemartin_dhaultfoeuille.py:L169-L181. Concrete fix: either support a missing-period placeholder path consistent with that note, or revise the note to recommend a preprocessing route that the current validator actually accepts.
Severity P3. Impact: the validation-test block says it is pinning the shared helper against “drop NaN rows in groupby,” but the actual cases only cover NaN treatment, NaN outcome, non-binary treatment, empty input, and within-cell treatment variation. There is still no regression for NaN group or NaN time, which is the exact path behind the P0 above. References: tests/test_chaisemartin_dhaultfoeuille.py:L1612-L1666, diff_diff/chaisemartin_dhaultfoeuille.py:L153-L199. Concrete fix: add explicit regressions for missing group and missing time on both fit() and twowayfeweights(), and assert the new error contract.

Review limitation: I could not run the tests here because numpy, pandas, and pytest are unavailable in this runner.

Path to Approval

Add explicit missing-value validation for group and time in _validate_and_aggregate_to_cells() before the groupby, with ValueError messages parallel to the existing treatment / outcome guards.
Add regressions covering NaN group and NaN time for both public entry points: ChaisemartinDHaultfoeuille.fit() and twowayfeweights().
Fix the registry note on global calendar gaps so its recommended workaround matches actual validator behavior, or implement the missing-period placeholder path it currently describes.

P0: groupby silently drops NaN keys, so NaN in the group or time columns would silently shrink the estimation sample without warning. Added explicit NaN checks for both columns in _validate_and_aggregate_to_cells() before the groupby call, parallel to the existing treatment/outcome NaN validators. Both fit() and twowayfeweights() share the validation. Four regression tests added: test_twowayfeweights_rejects_nan_group, test_twowayfeweights_rejects_nan_time, test_fit_rejects_nan_group, test_fit_rejects_nan_time. Also fixed the REGISTRY period-index-semantics workaround text that previously suggested inserting rows with NaN outcome for missing calendar periods. That would fail the outcome-NaN validator. Updated to recommend finite imputed outcomes instead. Test counts: 115 -> 119. Black, ruff clean. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

igerber · 2026-04-12T11:52:13Z

/ai-review

github-actions · 2026-04-12T12:05:13Z

🔁 AI review rerun (requested by @igerber)

Head SHA: 639f5dcc90b26cc8dad00d3c08b4257092188907

Overall Assessment

⚠️ Needs changes — highest unmitigated severity is P1.

Executive Summary

The prior re-review blocker is resolved: the shared cell validator now rejects NaN in group and time before groupby, and both fit() and twowayfeweights() have regressions for that contract in diff_diff/chaisemartin_dhaultfoeuille.py:L153-L169 and tests/test_chaisemartin_dhaultfoeuille.py:L1655-L1697.
P1: the new TWFE diagnostic’s sigma_fe does not implement the AER Corollary 1 quantity it claims to expose. The code computes the threshold from contribution weights N_{g,t} ε_{g,t} / Σ N_{g,t} ε_{g,t} with an unweighted sum-of-squares formula in diff_diff/chaisemartin_dhaultfoeuille.py:L2018-L2100, but the paper defines σ(w) using w_{g,t} = ε_{g,t} / E_treated[ε] and observation-share weights N_{g,t}/N_1. That makes the reported sign-flip threshold too optimistic. (nber.org)
The core Phase 1 dCDH estimator path otherwise looks sound on re-review: the DID_{+,t} / DID_{-,t} / aggregate WTC construction, the placebo logic, and the cohort-recentered inference target all line up with the cited AER and dynamic-paper formulas, or the departures are explicitly called out in REGISTRY.md and therefore non-blocking under this rubric. (nber.org)
P3: the equal-cell weighting behavior is explicit in code, tests, and the registry, so it is informational rather than blocking here. But the registry/test wording currently overstates it as “paper-literal”; the AER formulas define the N_{...} objects as sums of N_{g,t} observations, not counts of (g,t) cells. docs/methodology/REGISTRY.md:L493-L509, tests/test_chaisemartin_dhaultfoeuille.py:L733-L813. (nber.org)
P3: placebo DID_M^pl is repeatedly cited as “Theorem 4 of AER 2020”, but in the paper the placebo result is Theorem 3 and Theorem 4 is the fuzzy-design extension. diff_diff/chaisemartin_dhaultfoeuille.py:L290-L293, docs/methodology/REGISTRY.md:L509-L516, diff_diff/chaisemartin_dhaultfoeuille_results.py:L591-L591, CHANGELOG.md:L14-L14. (nber.org)
I could not run the test suite here because numpy, pandas, and pytest are not installed in this runner.

Methodology

The main estimator path is in reasonable shape for a re-review: AER 2020’s sharp-design DID_{+,t} / DID_{-,t} / aggregate WT C definitions and the dynamic paper’s DID_ℓ / cohort-recentered CI machinery are the right source anchors for this Phase 1 implementation. The meaningful departures I spot-checked are registry-noted rather than silent. (nber.org)

Severity P1. Impact: twowayfeweights() and fitted results.twfe_sigma_fe are presented as the Corollary 1 sign-flip threshold, but [compute_twfe_diagnostic()](/home/runner/work/diff-diff/diff-diff/diff_diff/chaisemartin_dhaultfoeuille.py#L2018) computes sigma_fefrom the already-multiplied contribution weightsN{g,t} ε_{g,t} / Σ N_{g,t} ε_{g,t}viasqrt(sum_treated w^2 - mean(w)^2 * n_treated)in [diff_diff/chaisemartin_dhaultfoeuille.py:L2025-L2100](/home/runner/work/diff-diff/diff-diff/diff_diff/chaisemartin_dhaultfoeuille.py#L2025). AER Theorem 1 / Corollary 1 instead defineβ_feusing(N_{g,t}/N_1) w_{g,t}but defineσ(w)from the paper’sw_{g,t}itself, weighted byN_{g,t}/N_1. Those are different objects; even on balanced panels the current denominator is smaller by a sqrt(n_treated)factor, sosigma_feis overstated and makes TWFE look more robust to heterogeneity than the paper supports. Concrete fix: compute the paper’sw_{g,t}explicitly (or recover it from the contribution weights by dividing byN_{g,t}/N_1), then compute sigma_fe = abs(beta_fe) / sqrt(sum((N_{g,t}/N_1) * (w_{g,t} - 1)^2))`, and keep the stored per-cell contribution weights separate if you still want to expose them. (nber.org)
Severity P3. Impact: the estimator intentionally equal-weights aggregated (group,time) cells after aggregation, and that behavior is explicit in code and tests at diff_diff/chaisemartin_dhaultfoeuille.py:L1425-L1448, docs/methodology/REGISTRY.md:L493-L509, and tests/test_chaisemartin_dhaultfoeuille.py:L733-L813. Under this rubric that makes it informational, not blocking. The issue is that the registry/test language currently calls this “paper-literal,” while the AER paper defines N_{d,d',t} and N_{d,d',d'',t} as sums of N_{g,t} observation counts. Concrete fix: reword the note/test to describe equal-cell weighting as an intentional library deviation on uneven cell sizes, not as literal paper fidelity. (nber.org)

Code Quality

No additional findings. The main re-review regression from the prior AI review is fixed in diff_diff/chaisemartin_dhaultfoeuille.py:L153-L169.

Performance

No findings.

Maintainability

No findings beyond the test gap noted below.

Tech Debt

No mitigating TODO.md entry covers the P1 sigma_fe issue, and under this rubric it would not downgrade that finding anyway.

Security

No findings.

Documentation/Tests

Severity P2. Impact: the new methodology coverage for the TWFE diagnostic only checks that beta_fe, fraction_negative, and sigma_fe are populated and that some stored weights are nonzero in tests/test_methodology_chaisemartin_dhaultfoeuille.py:L437-L446. That would not catch the Corollary 1 mismatch above. Concrete fix: add a hand-calculable TWFE example that asserts the reported treated-cell weights and sigma_fe against the paper formula, not just non-nullness. (nber.org)
Severity P3. Impact: placebo DID_M^pl is mis-cited as “Theorem 4” in multiple user-facing surfaces, including diff_diff/chaisemartin_dhaultfoeuille.py:L290-L293, docs/methodology/REGISTRY.md:L509-L516, diff_diff/chaisemartin_dhaultfoeuille_results.py:L591-L591, and CHANGELOG.md:L14-L14. Concrete fix: rename those references to Theorem 3. (nber.org)

Path to Approval

Fix sigma_fe so twowayfeweights() and fitted results.twfe_sigma_fe use the AER Corollary 1 w_{g,t} / N_{g,t}/N_1 formula, rather than the current contribution-weight surrogate.
Add a regression test with a hand-checkable TWFE example that asserts the reported per-cell weights and sigma_fe numerically; the current tests only check that the fields are present.

…itation Three fixes from CI review: P1 — sigma_fe now uses observation-share-weighted variance of the per-cell TWFE weights, matching AER Corollary 1. The previous code computed an unweighted variance across treated cells, which understates sigma_fe (reports a threshold that's too optimistic). The fix uses shares s_{g,t} = N_{g,t} / N_1 where N_1 is total treated observations, then computes: var_w = sum(s * (w - w_bar)^2) sigma_fe = |beta_fe| / sqrt(var_w) On equal-cell-size panels (the common case and the parity-test DGPs), the two formulas produce identical results. P3 — Softened "paper-literal" wording for cell-count weighting in REGISTRY.md. The AER paper's N_{a,b,t} notation can be read as observation sums or cell counts. The equal-cell interpretation is the one we implement, and the deviation from R is already labeled. P3 — Fixed placebo citation from "Theorem 4 of AER 2020" to the correct reference. The reviewer (citing the published AER paper and NBER WP) notes that the placebo specification is part of the same section as the DID_M definition, not a separate Theorem 4 (which covers fuzzy designs). Removed the incorrect Theorem 4 number from all 8 affected surfaces (source code, results dataclass, REGISTRY, CHANGELOG, ROADMAP, API rst, llms-full.txt, test docstring). 119 tests still pass. Black, ruff clean. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

igerber · 2026-04-12T12:10:47Z

/ai-review

github-actions · 2026-04-12T12:19:09Z

🔁 AI review rerun (requested by @igerber)

Head SHA: 62b0f93777e232c2c0c673e5b4b218ca2f97de3d

Overall Assessment

⚠️ Needs changes — highest unmitigated severity is P1.

This re-review is down to one remaining blocker: the TWFE diagnostic still does not compute the Corollary 1 sigma_fe quantity it claims to expose in diff_diff/chaisemartin_dhaultfoeuille.py:L2049-L2111. citeturn4view0turn3view2

Executive Summary

The prior NaN-key issue appears resolved: shared validation now rejects NaN in group and time before groupby in diff_diff/chaisemartin_dhaultfoeuille.py:L153-L169, with regressions in tests/test_chaisemartin_dhaultfoeuille.py:L1655-L1697.
The prior silent cluster no-op also appears resolved: non-None cluster now raises NotImplementedError in diff_diff/chaisemartin_dhaultfoeuille.py:L383-L394 and diff_diff/chaisemartin_dhaultfoeuille.py:L413-L460, with coverage in tests/test_chaisemartin_dhaultfoeuille.py:L326-L359.
P1: the TWFE diagnostic’s sigma_fe still uses contribution weights ((N_{g,t}/N_1) * w_{g,t}) as if they were the paper’s w_{g,t} object, so the reported sign-flip threshold is not the Corollary 1 quantity described by the API in diff_diff/chaisemartin_dhaultfoeuille.py:L2049-L2111, diff_diff/chaisemartin_dhaultfoeuille_results.py:L247-L250, and docs/api/chaisemartin_dhaultfoeuille.rst:L148-L153. citeturn4view0turn3view2
P2: methodology coverage for the TWFE diagnostic is still too weak to catch that bug; the current test only checks that fields are populated and some weight is nonzero in tests/test_methodology_chaisemartin_dhaultfoeuille.py:L420-L446.
I did not find another unmitigated P0/P1 in the changed dCDH estimator, placebo, bootstrap-surface, or forward-compat-gating paths.
I could not run the test suite here because numpy, pandas, and pytest are not installed in this runner.

Methodology

Severity P1
Impact: _compute_twfe_diagnostic() in diff_diff/chaisemartin_dhaultfoeuille.py:L2049-L2111 computes w_gt = (N_{g,t} * eps_{g,t}) / denom and then uses the treated-share-weighted variance of that object around its weighted mean for sigma_fe. But Theorem 1 / Corollary 1 separate the treated-cell contribution weights used in the TWFE decomposition from the paper’s w_{g,t} object used inside sigma(w), which is centered at 1 under treated-cell weighting. So results.twfe_sigma_fe and twowayfeweights().sigma_fe are not the sign-flip threshold the code and docs claim to expose. I did not find a mitigating Note/Deviation for this in docs/methodology/REGISTRY.md:L566-L588 or a tracking entry in TODO.md:L51-L77. citeturn4view0turn3view2
Concrete fix: keep the exported per-cell decomposition weights if desired, but compute sigma_fe from the paper’s w_{g,t} object. Concretely, use treated observation shares s_{g,t} = N_{g,t} / N_1, form w_paper = eps / sum_treated(s * eps) on treated cells (equivalently w_paper = contribution_weight / s), then compute sigma_fe = abs(beta_fe) / sqrt(sum_treated(s * (w_paper - 1)**2)). If the public weights column remains the contribution-weight object, rename or document it accordingly so it is not conflated with Corollary 1’s w_{g,t}.

Code Quality

No findings.

Performance

No findings.

Maintainability

No findings.

Tech Debt

No deferrable findings. TODO.md does not track the sigma_fe issue, so it remains unmitigated under the stated rubric.

Security

No findings.

Documentation/Tests

Severity P2
Impact: the TWFE methodology test in tests/test_methodology_chaisemartin_dhaultfoeuille.py:L420-L446 only checks that beta_fe, fraction_negative, sigma_fe, and weights are present/nonzero. That would not catch the current sigma_fe formula mismatch, nor any future confusion between exposed contribution weights and the paper’s w_{g,t} object.
Concrete fix: add a small hand-checkable TWFE panel and assert beta_fe, the exposed weight column, and sigma_fe numerically. The regression should compute the expected value from the paper formula directly, not just check non-nullness.

Path to Approval

Fix sigma_fe so the diagnostic uses the Corollary 1 w_{g,t} / treated-share formula, rather than the current contribution-weight surrogate.
Add a numeric regression test for the TWFE diagnostic that would fail on the current implementation.
Update the TWFE diagnostic docstring/API wording if the exported weights field continues to represent contribution weights rather than Corollary 1’s w_{g,t}.

P1: Round 17's sigma_fe fix still used the contribution weights w_gt = s * w_paper as if they were the paper's w_{g,t} object. The paper's Corollary 1 defines w_{g,t} = eps / E_treated[eps], which does NOT include the observation-share factor in the numerator. The contribution weights (which include the share factor) are the right object for the TWFE decomposition itself, but sigma_fe needs the paper's w_{g,t} centered at 1. The fix computes w_paper separately: w_paper = eps_treated / sum(s * eps_treated) sigma(w) = sqrt(sum(s * (w_paper - 1)^2)) sigma_fe = |beta_fe| / sigma(w) The exported weights column still contains the contribution weights (the TWFE decomposition object), not the paper weights. P2: added test_twfe_diagnostic_hand_checkable_sigma_fe in TestTWFEDiagnostic with a 4-group 3-period staggered panel. Asserts beta_fe = 3.5, fraction_negative = 0.0, and sigma_fe = 6.8641 numerically. This test would have caught both the Round 17 and pre-Round-17 sigma_fe formula errors. Test counts: 119 -> 120. Black, ruff clean. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

igerber · 2026-04-12T12:24:00Z

/ai-review

github-actions · 2026-04-12T12:29:01Z

🔁 AI review rerun (requested by @igerber)

Head SHA: dd52d27c747d1c8372ee19bc76518234459ad5df

Overall Assessment

✅ Looks good — no unmitigated P0 or P1 findings in the changed dCDH estimator code. The prior sigma_fe methodology blocker appears resolved in diff_diff/chaisemartin_dhaultfoeuille.py:L2092 and is now pinned by a numeric regression in tests/test_methodology_chaisemartin_dhaultfoeuille.py:L448.

Executive Summary

The previous P1 on TWFE sigma_fe looks fixed: the implementation now distinguishes exported contribution weights from the Corollary 1 paper weights used inside sigma_fe, and computes the latter from treated-cell shares and residuals in diff_diff/chaisemartin_dhaultfoeuille.py:L2092.
The new hand-checkable TWFE test would have failed under the prior formula, so the fix is now regression-covered in tests/test_methodology_chaisemartin_dhaultfoeuille.py:L448.
I did not find a new unmitigated methodology defect in the changed estimator, placebo, bootstrap, A11 handling, or parameter-gating paths.
Bootstrap surface propagation is internally consistent now: top-level p-values/CIs are taken from the bootstrap results, while t-stats still come through safe_inference() in diff_diff/chaisemartin_dhaultfoeuille.py:L1133.
One remaining issue is documentation-only: the helper docstring and user docs still blur the distinction between exported TWFE decomposition weights and the paper’s Corollary 1 w_{g,t} object used for sigma_fe.
I could not run tests in this runner because pytest is not installed.

Methodology

No findings. The re-review blocker on sigma_fe appears addressed by the Corollary 1 implementation in diff_diff/chaisemartin_dhaultfoeuille.py:L2092 and the new numeric test in tests/test_methodology_chaisemartin_dhaultfoeuille.py:L448.

Code Quality

No findings.

Performance

No findings.

Maintainability

No findings.

Tech Debt

No findings. I did not see a new correctness limitation that needs TODO.md tracking.

Security

No findings.

Documentation/Tests

Severity P3. Impact: _compute_twfe_diagnostic() still documents the old sigma_fe formula in terms of exported w, even though the code now correctly uses the Corollary 1 paper weights derived from treated-share residuals; the README/API text also only says “per-cell weights,” which leaves the two weight objects easy to conflate. See diff_diff/chaisemartin_dhaultfoeuille.py:L2018, README.md:L1212, and docs/api/chaisemartin_dhaultfoeuille.rst:L148. Concrete fix: update the helper docstring and user-facing docs to say explicitly that weights/twfe_weights are Theorem 1 contribution weights, while sigma_fe is computed from the Corollary 1 paper w_{g,t} object built internally from treated observation shares; renaming the exported column to contribution_weight would remove the ambiguity entirely.

P3: the helper docstring still described the old sigma_fe formula in terms of the exported contribution weights. Updated to explicitly distinguish the two weight objects: - Contribution weights (Theorem 1 decomposition, exported in the weights column): cw = N * eps / denom - Paper weights (Corollary 1, used internally for sigma_fe): w_paper = eps / sum(s * eps), centered at 1 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

igerber · 2026-04-12T12:31:09Z

/ai-review

github-actions · 2026-04-12T12:39:11Z

🔁 AI review rerun (requested by @igerber)

Head SHA: 5b46f407c8da9983725820ee04f83e4ba651fdd7

Overall Assessment

✅ Looks good — no unmitigated P0 or P1 findings in the changed dCDH estimator PR.

Executive Summary

The prior sigma_fe methodology blocker remains resolved: _compute_twfe_diagnostic() now separates exported contribution weights from the Corollary 1 paper weights used inside sigma_fe, and the hand-checkable regression still pins the expected threshold in chaisemartin_dhaultfoeuille.py#L2012 and test_methodology_chaisemartin_dhaultfoeuille.py#L448.
The changed estimator code matches the new registry contracts I checked for cell-count weighting, A11 zero-retention, placebo-NaN inference, cohort-recentered variance, and the pre-filter TWFE sample contract in REGISTRY.md#L493, REGISTRY.md#L566, chaisemartin_dhaultfoeuille.py#L1351, chaisemartin_dhaultfoeuille.py#L1496, and chaisemartin_dhaultfoeuille.py#L1741.
I did not find new inline-inference or partial-NaN-guard anti-patterns in the changed files; changed inference paths route through safe_inference() or the bootstrap helper’s all-or-nothing NaN gate in utils.py#L152, bootstrap_utils.py#L279, and chaisemartin_dhaultfoeuille.py#L1034.
Ragged-panel handling and the fitted-vs-standalone TWFE sample contract are explicit in code and covered by targeted tests in chaisemartin_dhaultfoeuille.py#L627, chaisemartin_dhaultfoeuille.py#L749, and test_chaisemartin_dhaultfoeuille.py#L1453.
The only remaining issue I found is documentation-only: high-level README/API text still leaves the twfe_weights vs sigma_fe weight-object distinction easy to miss.
I could not execute the test suite in this runner because the Python environment here is missing project dependencies (pandas; pytest also appears unavailable).

Methodology

No findings. Against the Methodology Registry baseline, I did not find an undocumented estimator, weighting, variance/SE, identification, or default-behavior mismatch in the changed dCDH code. The prior sigma_fe issue remains fixed in chaisemartin_dhaultfoeuille.py#L2100 and regression-covered in test_methodology_chaisemartin_dhaultfoeuille.py#L448.

Code Quality

No findings. New public parameters are propagated consistently across constructor/introspection/wrapper paths in chaisemartin_dhaultfoeuille.py#L356, chaisemartin_dhaultfoeuille.py#L413, and chaisemartin_dhaultfoeuille.py#L2180.

Performance

No findings. The changed estimator and bootstrap paths are vectorized over pre-aggregated matrices, and the bootstrap helper uses batched weight generation in chaisemartin_dhaultfoeuille_bootstrap.py#L290.

Maintainability

No findings. The shared validation/aggregation helper and explicit sample-contract warning reduce drift between fit() and twowayfeweights() in chaisemartin_dhaultfoeuille.py#L94 and chaisemartin_dhaultfoeuille.py#L749.

Tech Debt

No findings. I did not identify a new deferrable item that needs tracking in TODO.md.

Security

No findings.

Documentation/Tests

Severity P3. Impact: the code-facing docs now distinguish Theorem 1 contribution weights from the Corollary 1 paper weights used for sigma_fe, but the high-level README/API still present twfe_weights and twfe_sigma_fe as one generic diagnostic surface, so users can still miss that they are not computed from the same weight object. See README.md#L1212, chaisemartin_dhaultfoeuille.rst#L148, chaisemartin_dhaultfoeuille.py#L2025, and chaisemartin_dhaultfoeuille_results.py#L227. Concrete fix: add one sentence to the README and API page stating that twfe_weights are Theorem 1 contribution weights, while sigma_fe is computed internally from the Corollary 1 paper weights built from treated-cell shares and FE residuals.
No other documentation/test findings. The risky new branches are covered by focused tests for cell-count weighting, ragged panels, placebo NaN behavior, bootstrap propagation, and the TWFE sample contract in test_chaisemartin_dhaultfoeuille.py#L733, test_chaisemartin_dhaultfoeuille.py#L636, test_chaisemartin_dhaultfoeuille.py#L1056, test_chaisemartin_dhaultfoeuille.py#L1108, and test_chaisemartin_dhaultfoeuille.py#L1506.

The test_rank_deficient_action_error_raises_on_fitted_twfe test failed on all CI runners because the Rust backend's solve_ols handles underdetermined systems without raising ValueError (unlike the Python backend). The test relied on solve_ols raising "Fewer observations than parameters" for a 1-group panel, which is backend-specific behavior. Fix: added back the n_groups < 2 or n_times < 2 guard in _build_group_time_design (removed in Round 14, needed again now). This guard fires deterministically on both backends before solve_ols is called. Updated the test to match the guard's error message ("at least 2 groups") instead of the backend-specific solve_ols message. 120 tests still pass locally. Black, ruff clean. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

igerber and others added 3 commits April 11, 2026 15:50

igerber removed the ready-for-ci Triggers CI test workflows label Apr 12, 2026

igerber added the ready-for-ci Triggers CI test workflows label Apr 12, 2026

igerber merged commit 5533a2b into main Apr 12, 2026
19 of 20 checks passed

igerber deleted the dcdh-estimator branch April 12, 2026 15:03

igerber mentioned this pull request Apr 12, 2026

Release v3.0.2 #301

Merged

Conversation

igerber commented Apr 11, 2026

Summary

Methodology references

Intentional deviations from R DIDmultiplegtDYN (documented as **Note (deviation from R):** in REGISTRY.md)

Documented matches with R (worth noting because they're substantive choices)

No silent failures

Validation

Test files added

Test files modified

R parity reference

Documentation surface

Test count

No tutorial in this PR

Security / privacy

Uh oh!

igerber commented Apr 11, 2026

Uh oh!

github-actions bot commented Apr 11, 2026

Uh oh!

igerber commented Apr 11, 2026

Uh oh!

github-actions bot commented Apr 11, 2026

Uh oh!

igerber commented Apr 11, 2026

Uh oh!

github-actions bot commented Apr 11, 2026

Uh oh!

igerber commented Apr 11, 2026

Uh oh!

github-actions bot commented Apr 11, 2026

Uh oh!

igerber commented Apr 11, 2026

Uh oh!

github-actions bot commented Apr 11, 2026

Uh oh!

igerber commented Apr 11, 2026

Uh oh!

github-actions bot commented Apr 12, 2026

Uh oh!

igerber commented Apr 12, 2026

Uh oh!

github-actions bot commented Apr 12, 2026

Uh oh!

igerber commented Apr 12, 2026

Uh oh!

github-actions bot commented Apr 12, 2026

Uh oh!

igerber commented Apr 12, 2026

Uh oh!

github-actions bot commented Apr 12, 2026

Uh oh!

igerber commented Apr 12, 2026

Uh oh!

github-actions bot commented Apr 12, 2026

Uh oh!

igerber commented Apr 12, 2026

Uh oh!

github-actions bot commented Apr 12, 2026

Uh oh!

igerber commented Apr 12, 2026

Uh oh!

github-actions bot commented Apr 12, 2026

Uh oh!

igerber commented Apr 12, 2026

Uh oh!

github-actions bot commented Apr 12, 2026

Uh oh!

igerber commented Apr 12, 2026

Uh oh!

github-actions bot commented Apr 12, 2026

Uh oh!

igerber commented Apr 12, 2026

Uh oh!

github-actions bot commented Apr 12, 2026

Uh oh!

igerber commented Apr 12, 2026

Uh oh!

github-actions bot commented Apr 12, 2026

Uh oh!

Intentional deviations from R `DIDmultiplegtDYN` (documented as `Note (deviation from R):` in REGISTRY.md)